Machine Learning Tutorial

Author

Kayla McComb

Introduction to Machine Learning

Machine learning is the process of building predictive models. We encounter applications for machine learning each day. Machine learning algorithms are used to make critical decisions in medical diagnosis, give song or movie recommendations, or provide companies with insight into customers’ purchasing behavior.

Rob’s definition of machine learning:

Machine learning is a framework for using different classes of algorithms to train models and test their predictive accuracy from data.

  • Framework: a general workflow for data handling and algorithm selection​

  • Train models: building mathematical equations that can be applied to other data​

  • Predictive accuracy: assessing how well your models perform on new data​

This tutorial will:

  1. Present a basic understanding of machine learning.
  2. Provide examples of how to use the caret package to carry out regression and classification modeling.
  3. Provide one example of unsupervised learning using k-means clustering.

Supervised vs. Unsupervised Learning

There are two main types of machine learning algorithms: supervised and unsupervised

Supervised learning models are those where the machine learning model we build is based off of a known quantity. In this case, we already know the “correct answers” and we train the algorithm to find patterns in our data in order to reach the best performance in predicting a known quanity. We can then apply these models to never-before-seen data to make predictions. Examples include:

  • Classification Models: Making categorical predictions such as predicting whether a tumor is benign or malignant or whether an email is spam or not.

  • Regression Models: Making continuous predictions such as predicting changes in temperature or predicting someone’s weight.

Unsupervised learning models are those where the machine learning model derives patterns and information from data while determining the known quantity itself. In this case, there are no known “correct answers” but rather the goal is to find the underlying structure or distribution in the data in order to learn more about the data. Examples include:

  • Clustering Models: Finding hidden patterns of inherent groupings in data such as grouping customers by purchasing behavior.

  • Association Models: Findings rules that describe large portions of your data, such as “people that buy X also tend to buy Y”.

Common Machine Learning Algorithms

Algorithms are the set of steps that are passed into a model for processing. The best algorithm choice depends on your question and the type of data you are working with.

Overfitting

Overfitting is the problem of fitting a model too well to the training set. This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.

Steps of Machine Learning

  1. Data Splitting
  2. Select Algorithm
  3. Tuning
  4. Cross-Validation
  5. Model Training
  6. Model Testing

Load Packages

#install.packages("caret)
#install.packages("randomForest")
#install.packages("xgboost")
#install.packages("glmnet")
#install.packages("kernlab)
#install.packages("MLmetrics")
#install.packages("nnet")
#install.packages("ggridges")
#install.packages("plotly")
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::lift()   masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
library(xgboost)
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-10
library(kernlab)

Attaching package: 'kernlab'

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    alpha
library(MLmetrics)

Attaching package: 'MLmetrics'

The following objects are masked from 'package:caret':

    MAE, RMSE

The following object is masked from 'package:base':

    Recall
library(nnet)
library(ggridges)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Dataset

The dataset we will be using comes from kaggle. Kaggle is an online platform which houses several large downloadable datasets, hosts competitions for prediction problems, and provides other resources to learn about machine learning.

The dataset is collated from Spotify’s API using a python script to extract popular songs and their associated audio and descriptive features. Descriptive features of a song include information about the song such as the artist name, album name and release date. Audio features include key, valence, danceability, and energy which are results of spotify’s audio analysis.

Import data

spotify_data <- read_csv("https://www.dropbox.com/scl/fi/nom62jv3vj5273jea9sje/high_popularity_spotify_data.csv?rlkey=orm4gvkq34d56fpedlejnkzvq&st=acw659gk&dl=1")
Rows: 1686 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): playlist_genre, track_artist, track_href, uri, track_album_name, p...
dbl (14): energy, tempo, danceability, loudness, liveness, valence, time_sig...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data

Let’s take a look at the columns.

str(spotify_data)
spc_tbl_ [1,686 × 29] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ energy                  : num [1:1686] 0.592 0.507 0.808 0.91 0.783 0.582 0.561 0.247 0.416 0.722 ...
 $ tempo                   : num [1:1686] 158 105 109 113 149 ...
 $ danceability            : num [1:1686] 0.521 0.747 0.554 0.67 0.777 0.7 0.669 0.467 0.492 0.769 ...
 $ playlist_genre          : chr [1:1686] "pop" "pop" "pop" "pop" ...
 $ loudness                : num [1:1686] -7.78 -10.17 -4.17 -4.07 -4.48 ...
 $ liveness                : num [1:1686] 0.122 0.117 0.159 0.304 0.355 0.0881 0.0954 0.17 0.203 0.111 ...
 $ valence                 : num [1:1686] 0.535 0.438 0.372 0.786 0.939 0.785 0.841 0.126 0.297 0.57 ...
 $ track_artist            : chr [1:1686] "Lady Gaga, Bruno Mars" "Billie Eilish" "Gracie Abrams" "Sabrina Carpenter" ...
 $ time_signature          : num [1:1686] 3 4 4 4 4 4 4 4 4 4 ...
 $ speechiness             : num [1:1686] 0.0304 0.0358 0.0368 0.0634 0.26 0.0356 0.0411 0.0431 0.0254 0.0507 ...
 $ track_popularity        : num [1:1686] 100 97 93 81 98 94 88 93 71 92 ...
 $ track_href              : chr [1:1686] "https://api.spotify.com/v1/tracks/2plbrEY59IikOBgBGLjaoe" "https://api.spotify.com/v1/tracks/6dOtVTDdiauQNBQEDOtlAB" "https://api.spotify.com/v1/tracks/7ne4VBA60CxGM75vw0EYad" "https://api.spotify.com/v1/tracks/1d7Ptw3qYcfpdLNL5REhtJ" ...
 $ uri                     : chr [1:1686] "spotify:track:2plbrEY59IikOBgBGLjaoe" "spotify:track:6dOtVTDdiauQNBQEDOtlAB" "spotify:track:7ne4VBA60CxGM75vw0EYad" "spotify:track:1d7Ptw3qYcfpdLNL5REhtJ" ...
 $ track_album_name        : chr [1:1686] "Die With A Smile" "HIT ME HARD AND SOFT" "The Secret of Us (Deluxe)" "Short n' Sweet" ...
 $ playlist_name           : chr [1:1686] "Today's Top Hits" "Today's Top Hits" "Today's Top Hits" "Today's Top Hits" ...
 $ analysis_url            : chr [1:1686] "https://api.spotify.com/v1/audio-analysis/2plbrEY59IikOBgBGLjaoe" "https://api.spotify.com/v1/audio-analysis/6dOtVTDdiauQNBQEDOtlAB" "https://api.spotify.com/v1/audio-analysis/7ne4VBA60CxGM75vw0EYad" "https://api.spotify.com/v1/audio-analysis/1d7Ptw3qYcfpdLNL5REhtJ" ...
 $ track_id                : chr [1:1686] "2plbrEY59IikOBgBGLjaoe" "6dOtVTDdiauQNBQEDOtlAB" "7ne4VBA60CxGM75vw0EYad" "1d7Ptw3qYcfpdLNL5REhtJ" ...
 $ track_name              : chr [1:1686] "Die With A Smile" "BIRDS OF A FEATHER" "That’s So True" "Taste" ...
 $ track_album_release_date: chr [1:1686] "2024-08-16" "2024-05-17" "2024-10-18" "2024-08-23" ...
 $ instrumentalness        : num [1:1686] 0.00 6.08e-02 0.00 0.00 0.00 0.00 9.62e-03 2.71e-04 8.61e-05 2.56e-06 ...
 $ track_album_id          : chr [1:1686] "10FLjwfpbxLmW8c25Xyc2N" "7aJuG4TFXa2hmE4z1yxc3n" "0hBRqPYPXhr1RkTDG3n4Mk" "4B4Elma4nNDUyl6D5PvQkj" ...
 $ mode                    : num [1:1686] 0 1 1 0 0 0 1 0 1 0 ...
 $ key                     : num [1:1686] 6 2 1 0 0 11 10 6 11 11 ...
 $ duration_ms             : num [1:1686] 251668 210373 166300 157280 169917 ...
 $ acousticness            : num [1:1686] 0.308 0.2 0.214 0.0939 0.0283 0.0502 0.495 0.612 0.686 0.0584 ...
 $ id                      : chr [1:1686] "2plbrEY59IikOBgBGLjaoe" "6dOtVTDdiauQNBQEDOtlAB" "7ne4VBA60CxGM75vw0EYad" "1d7Ptw3qYcfpdLNL5REhtJ" ...
 $ playlist_subgenre       : chr [1:1686] "mainstream" "mainstream" "mainstream" "mainstream" ...
 $ type                    : chr [1:1686] "audio_features" "audio_features" "audio_features" "audio_features" ...
 $ playlist_id             : chr [1:1686] "37i9dQZF1DXcBWIGoYBM5M" "37i9dQZF1DXcBWIGoYBM5M" "37i9dQZF1DXcBWIGoYBM5M" "37i9dQZF1DXcBWIGoYBM5M" ...
 - attr(*, "spec")=
  .. cols(
  ..   energy = col_double(),
  ..   tempo = col_double(),
  ..   danceability = col_double(),
  ..   playlist_genre = col_character(),
  ..   loudness = col_double(),
  ..   liveness = col_double(),
  ..   valence = col_double(),
  ..   track_artist = col_character(),
  ..   time_signature = col_double(),
  ..   speechiness = col_double(),
  ..   track_popularity = col_double(),
  ..   track_href = col_character(),
  ..   uri = col_character(),
  ..   track_album_name = col_character(),
  ..   playlist_name = col_character(),
  ..   analysis_url = col_character(),
  ..   track_id = col_character(),
  ..   track_name = col_character(),
  ..   track_album_release_date = col_character(),
  ..   instrumentalness = col_double(),
  ..   track_album_id = col_character(),
  ..   mode = col_double(),
  ..   key = col_double(),
  ..   duration_ms = col_double(),
  ..   acousticness = col_double(),
  ..   id = col_character(),
  ..   playlist_subgenre = col_character(),
  ..   type = col_character(),
  ..   playlist_id = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
  • Energy, tempo, danceability, loudness, liveness, valence, time signature, speechiness, track popularity, instrumentalness, mode, key, duration, and acousticness are all numeric columns.

  • Playlist genre, track artist, album name, playlist name, album release date, and playlist subgenre are all character columns.

Most of these classifications make sense, but there are a few we should revise before we move on.

#change time signature to a factor variable 
spotify_data$time_signature <- as.factor(spotify_data$time_signature)
#change playlist genre to a factor 
spotify_data$playlist_genre <- as.factor(spotify_data$playlist_genre)
#change mode to a factor variable 
spotify_data$mode <- as.factor(spotify_data$mode)
#change key to a factor variable 
spotify_data$key <- as.factor(spotify_data$key)
#change track artist to just the first artist in list
spotify_data$track_artist <- sub(",.*", "", spotify_data$track_artist)
#change album release date to two columns: year & month 
spotify_data <- spotify_data %>%
  mutate(
    release_year = as.numeric(substr(track_album_release_date, 1, 4)),
    release_month = as.numeric(substr(track_album_release_date, 6, 7))
  )

spotify_data <- spotify_data %>%
  select(-track_album_release_date)

#clean out NAs
spotify_data <- na.omit(spotify_data)

#make valid r names for variables 
levels(spotify_data$playlist_genre) <- make.names(
  levels(spotify_data$playlist_genre)
)

spotify_data$playlist_genre <- factor(make.names(spotify_data$playlist_genre))

There are also some values I know I won’t use, so I’m going to go ahead & clean those out now.

spotify_data_clean <- spotify_data %>%
  select(-track_href, -uri, -analysis_url, -track_album_id, -id, -type, -playlist_id, -track_id)

Regression

For an example of regression, we will be predicting the popularity (0-100) of a track based on the audio and descriptive features of that track.

We will work with three different examples of algorithms: linear regression, random forest, and elastic net.

Data Splitting

Machine learning requires that you build your model using a separate dataset than the one used to test the accuracy of your model. Data splitting is used to split a single dataset into a training set and a testing set. You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training & 30% testing. You can either split the data randomly or stratified across important characteristics if necessary. More data in the training set will generally lead to a better model, but more data in the testing set will make the application of your model more generalizeable.

createDataPartition is used to split the original data into a training set and a testing set. The inputs into createDataPartition include:

  • y = the outcome variable

  • times = the number of times you want to split the data

  • p = the percentage of the data that goes into the training set

  • list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it

#make the split reproducible
set.seed(500)

#split the spotify data into a training & testing set 
split_data <- createDataPartition(
  y = spotify_data_clean$track_popularity, 
  times = 1, 
  p = .7, 
  list = FALSE)

training_set <- spotify_data_clean[split_data, ]
testing_set <- spotify_data_clean[-split_data, ] 

#clean out predictors with too many unique values 
training_set <- training_set %>%
  select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)

testing_set <- testing_set %>%
  select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)

Algorithm 1: Linear Regression

A linear regression algorithm is typically used when the relationships are mostly linear, independent, and the dataset is small. Usually it is used as the first type of model you might try.

Cross-Validation

Earlier we talked about splitting our our dataset into training and testing sets in order to build a model that generalizess well to new incoming data. However, this process of training and testing is less than ideal. When we eventually test our data against the remaining data in our test set we are only seeing what the error is for that exact grouping of the test data. Using a single testing set to evaluate the accuracy of one’s models can have limitations in practice because the testing set may not be representative of the dataset as a whole.

Cross-validation is a statistical technique for splitting the training set multiple times into training/testing sets. Each of these training/testing sets is evaluated for error, and the error across all of the sets is averaged. This provides a more accurate assessment of the model’s performance.

There are various cross-validation techniques available in R, but the ones we will cover are k-fold and leave-one-one cross-validation:

  • k-fold cross-validation: randomly splits the dataset into k chunks (aka, folds) of roughly equal size, and these chunks are split into training/testing sets. The error across all chunks is averaged. k can be any number between 2 and the number of observations in the full dataset, but it is most commonly a value between 3 and 10.

  • leave-one-out cross-validation: the case where k=n; this results in the training sets containing n-1 observations each, and each test set containing a single observation

#specify which cross-validation method we're using, we'll start with leave one out 
train_cv_leave <- trainControl(method = "LOOCV") 

Training

The train function is used for model training. It uses the following inputs:

  • y = the outcome variable; y ~. means predict y from all other variables in the dataset

  • method = the machine learning algorithm

  • trControl = the cross-validation method

  • tuneGrid = a data frame of the hyperparameters that you want to be evaluated during model training

  • preProc = any pre-processing adjustments that you want done on the predictor data

linear_model <- train(track_popularity ~., 
                      data = training_set, 
                      method = "lm", 
                      trControl = train_cv_leave) 
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
linear_model$results
  intercept     RMSE  Rsquared      MAE
1      TRUE 5.296372 0.2417261 4.290571

Interpreting Results

  • RMSE (Root Mean Squared Error): difference between values predicted and actual observed values

  • Rsquared (Coefficient of determination): proportion of variance explained

  • MAE (Mean absolute error): Average magnitude of error

Testing

linear_predicted <- predict(linear_model, testing_set)

linear_results <- postResample(linear_predicted, testing_set$track_popularity)

results <- data.frame(
  actual = testing_set$track_popularity,
  predicted = linear_predicted
)

ggplot(results, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() +
  xlim(63,100) +
  ylim(63,100) +
  labs(
    x = "Actual Popularity",
    y = "Predicted Popularity",
    title = "Actual vs Predicted Track Popularity"
  ) 
`geom_smooth()` using formula = 'y ~ x'

Algorithm 2: Random Forest

A random forest algorithm creates a collection of decision trees to make predictions. It repeatedly samples subsets of the training data and randomly selects a subset of predictors to consider at each split in the trees. In regression, the final prediction is the average prediction across all trees. Random forests are typically used for data with nonlinear relationships and interactions.

Cross-Validation

#this time let's use the k-fold cross-validation method, k = 5
train_cv_kfold <- trainControl(method = "cv", number = 5)

Tuning

In machine learning, we work with something we call hyperparameters. Hyperparameters are values that specify the settings of an algorithm that can be “tuned” by the researcher prior to training a model. Different algorithms have different hyperparameters. You have to rely on rules of thumb and/or try to find the best values through trial and error.

The linear model did not have any hyperparameters to tune, but random forest does. Random forest models have the mtry parameter, which means the number of predictors chosen randomly at each split.

#view hyperparameters
rf.info <- getModelInfo("rf")
rf.info$rf$parameters
  parameter   class                         label
1      mtry numeric #Randomly Selected Predictors
#create a grid of possible values for mtry 
tune.grid <- expand.grid(mtry = 4:8)

Training

rf_model <- train(track_popularity ~. , 
                  data = training_set, 
                  method = "rf", 
                  trControl = train_cv_kfold, 
                  tuneGrid = tune.grid)

rf_model
Random Forest 

1126 samples
  16 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 902, 901, 900, 901, 900 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  4     4.977366  0.3949759  4.072415
  5     4.904692  0.4003765  3.998827
  6     4.864938  0.4017046  3.941886
  7     4.843565  0.4003611  3.922037
  8     4.841753  0.3957096  3.913895

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 8.
rf_model$results
  mtry     RMSE  Rsquared      MAE     RMSESD RsquaredSD      MAESD
1    4 4.977366 0.3949759 4.072415 0.06533924  0.1086559 0.08594152
2    5 4.904692 0.4003765 3.998827 0.07805872  0.1050139 0.10237789
3    6 4.864938 0.4017046 3.941886 0.08388209  0.1042593 0.11014090
4    7 4.843565 0.4003611 3.922037 0.12131274  0.1089005 0.15311219
5    8 4.841753 0.3957096 3.913895 0.12187678  0.1098436 0.14242654

Testing

rf_predicted <- predict(rf_model, testing_set)

rf_results <- postResample(rf_predicted, testing_set$track_popularity) 

results <- data.frame(
  actual = testing_set$track_popularity,
  predicted = rf_predicted
)

ggplot(results, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() +
  xlim(63,100) +
  ylim(63,100) +
  labs(
    x = "Actual Popularity",
    y = "Predicted Popularity",
    title = "Actual vs Predicted Track Popularity"
  ) 
`geom_smooth()` using formula = 'y ~ x'

Algorithm 3: Elastic Net

An elastic net algorithm is linear regression that combines penalties from Lasso and Ridge regularization. It is particularly useful when many of the predictors are highly correlated.

Tuning

#view hyperparameters
glmnet.info <- getModelInfo("glmnet")
glmnet.info$glmnet$parameters
  parameter   class                    label
1     alpha numeric        Mixing Percentage
2    lambda numeric Regularization Parameter
#create a grid of possible values for mtry 
tune.grid <- expand.grid(alpha = c(0, .1, 1), lambda = c(.1, .5, 1))

Training

glmnet_model <- train(
  track_popularity ~ .,
  data = training_set,
  method = "glmnet",
  trControl = train_cv_kfold,
  tuneGrid = tune.grid
)

glmnet_model
glmnet 

1126 samples
  16 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 901, 901, 901, 901, 900 
Resampling results across tuning parameters:

  alpha  lambda  RMSE      Rsquared   MAE     
  0.0    0.1     5.285463  0.2428573  4.295552
  0.0    0.5     5.283446  0.2420817  4.295542
  0.0    1.0     5.283703  0.2417406  4.297958
  0.1    0.1     5.277191  0.2453683  4.288698
  0.1    0.5     5.259251  0.2489634  4.279334
  0.1    1.0     5.282091  0.2479294  4.302272
  1.0    0.1     5.243624  0.2532304  4.265397
  1.0    0.5     5.440262  0.2133681  4.432194
  1.0    1.0     5.637656  0.1836600  4.570915

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.
glmnet_model$results
  alpha lambda     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1   0.0    0.1 5.285463 0.2428573 4.295552 0.1989030 0.03907924 0.1361959
2   0.0    0.5 5.283446 0.2420817 4.295542 0.1996800 0.03744801 0.1346376
3   0.0    1.0 5.283703 0.2417406 4.297958 0.1986762 0.03516688 0.1312690
4   0.1    0.1 5.277191 0.2453683 4.288698 0.1957427 0.03984448 0.1347670
5   0.1    0.5 5.259251 0.2489634 4.279334 0.1957828 0.03446087 0.1268399
6   0.1    1.0 5.282091 0.2479294 4.302272 0.1947200 0.02824197 0.1209968
7   1.0    0.1 5.243624 0.2532304 4.265397 0.1935063 0.03251421 0.1203091
8   1.0    0.5 5.440262 0.2133681 4.432194 0.1819225 0.01989777 0.1331396
9   1.0    1.0 5.637656 0.1836600 4.570915 0.2075898 0.01519255 0.1401612

Testing

glmnet_predicted <- predict(glmnet_model, testing_set)

glmnet_results <- postResample(glmnet_predicted, testing_set$track_popularity) 

results <- data.frame(
  actual = testing_set$track_popularity,
  predicted = glmnet_predicted
)

ggplot(results, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() +
  xlim(63,100) +
  ylim(63,100) +
  labs(
    x = "Actual Popularity",
    y = "Predicted Popularity",
    title = "Actual vs Predicted Track Popularity"
  ) 
`geom_smooth()` using formula = 'y ~ x'

Comparing Our Models

comparison_stats <- rbind(linear_results, rf_results, glmnet_results)

comparison_stats
                   RMSE  Rsquared      MAE
linear_results 5.358648 0.2367604 4.313837
rf_results     4.595594 0.4562361 3.777065
glmnet_results 5.282273 0.2483996 4.230569

Looks like random forest does the best by far!

Classification

For an example of classification, we will be predicting the genre of a particular track based on the audio and descriptive features of that track.

We will work with three different examples of algorithms: support vector machine, random forest, and multinomial logistic regression.

Pre-Processing

There are a few genres in our data that describe a small amount of tracks. With limited data for these genres, it will be hard to predict, so we’ll want to remove these less common genres.

#identify rare genres
genre_counts <- table(spotify_data_clean$playlist_genre)
genre_counts

 afrobeats    ambient     arabic      blues  brazilian  classical    country 
        20         61         50         41         14         10          3 
electronic       folk     gaming    hip.hop     indian      indie      j.pop 
       145         32        100        227          9          4         11 
      jazz      k.pop     korean      latin       lofi      metal        pop 
         1         10          8        184          2         30        331 
      punk        r.b     reggae       rock       soul    turkish      world 
        46         48          5        200          2          7          4 
rare_genres <- names(genre_counts[genre_counts < 100])

#remove rare genres from dataset 
spotify_data_class <- spotify_data_clean %>%
  filter(!(playlist_genre %in% rare_genres))

#re do levels for factor variables 
spotify_data_class$playlist_genre <- factor(spotify_data_class$playlist_genre)

Data Splitting

#make the split reproducible
set.seed(433)

#split the spotify data into a training & testing set 
split_data <- createDataPartition(
  y = spotify_data_class$track_popularity, 
  times = 1, 
  p = .7, 
  list = FALSE)

training_set <- spotify_data_class[split_data, ]
testing_set <- spotify_data_class[-split_data, ] 

#clean out predictors with too many unique values 
training_set <- training_set %>%
  select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)

testing_set <- testing_set %>%
  select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)

Algorithm 1: Support Vector Machine (SVM)

Support vector machine works by finding the boundary that separates different groups of data. The boundary can be linear or non-linear, but for today’s example we will use a linear svm.

Cross-Validation

#let's use the k-fold cross-validation method, k = 3
train_cv_kfold <- trainControl(method = "cv", number = 2, classProbs = TRUE, summaryFunction = multiClassSummary, savePredictions = TRUE)

Tuning

#view hyperparameters
svmL.info <- getModelInfo("svmLinear")
svmL.info$svmLinear$parameters 
  parameter   class label
1         C numeric  Cost
#create a grid of possible values for cost
tune.grid <- expand.grid(C = c(.01, .1, 1))

Training

svm_model <- train(
  playlist_genre ~ .,
  data = training_set,
  method = "svmLinear",
  trControl = train_cv_kfold,
  tuneGrid = tune.grid,
  preProcess = c("center", "scale")
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
svm_model
Support Vector Machines with Linear Kernel 

833 samples
 16 predictor
  6 classes: 'electronic', 'gaming', 'hip.hop', 'latin', 'pop', 'rock' 

Pre-processing: centered (28), scaled (28) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 416, 417 
Resampling results across tuning parameters:

  C     logLoss   AUC        prAUC      Accuracy   Kappa      Mean_F1  
  0.01  1.371689  0.7866954  0.4324617  0.4441610  0.3129483  0.3784368
  0.10  1.375277  0.7778194  0.4257005  0.4561802  0.3238496  0.3981869
  1.00  1.386109  0.7841245  0.4318401  0.4477754  0.3154999  0.3980684
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.4218234         0.8850772         0.3689875            0.8879689          
  0.4240539         0.8865896         0.4247054            0.8895532          
  0.4198306         0.8853290         0.4059339            0.8883232          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.3689875       0.4218234    0.07402684           0.6534503             
  0.4247054       0.4240539    0.07603004           0.6553217             
  0.4059339       0.4198306    0.07462924           0.6525798             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 0.1.

Testing

svm_predicted <- predict(svm_model, testing_set) 
svm_results <- postResample(svm_predicted, testing_set$playlist_genre) 

Interpreting Results

For classification algorithms, the following are indicators of performance:

  • % Accuracy

  • Kappa

  • Sensitivity = correctly identify instances

  • Specificity = correctly identify non-instances

  • Confusion Matrix

svm_results
 Accuracy     Kappa 
0.4435028 0.3228992 
confusionMatrix(svm_predicted, testing_set$playlist_genre)
Confusion Matrix and Statistics

            Reference
Prediction   electronic gaming hip.hop latin pop rock
  electronic         20      9       9     8   8    3
  gaming              3      7       3     3  19    0
  hip.hop             6     12      41    21   7    0
  latin               1      0       0     1   1    0
  pop                 8      5       3    22  36    2
  rock                6      0       1     2  35   52

Overall Statistics
                                        
               Accuracy : 0.4435        
                 95% CI : (0.391, 0.497)
    No Information Rate : 0.2994        
    P-Value [Acc > NIR] : 7.387e-09     
                                        
                  Kappa : 0.3229        
                                        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: electronic Class: gaming Class: hip.hop
Sensitivity                     0.4545       0.21212         0.7193
Specificity                     0.8806       0.91277         0.8451
Pos Pred Value                  0.3509       0.20000         0.4713
Neg Pred Value                  0.9192       0.91850         0.9401
Prevalence                      0.1243       0.09322         0.1610
Detection Rate                  0.0565       0.01977         0.1158
Detection Prevalence            0.1610       0.09887         0.2458
Balanced Accuracy               0.6676       0.56245         0.7822
                     Class: latin Class: pop Class: rock
Sensitivity              0.017544     0.3396      0.9123
Specificity              0.993266     0.8387      0.8519
Pos Pred Value           0.333333     0.4737      0.5417
Neg Pred Value           0.840456     0.7482      0.9806
Prevalence               0.161017     0.2994      0.1610
Detection Rate           0.002825     0.1017      0.1469
Detection Prevalence     0.008475     0.2147      0.2712
Balanced Accuracy        0.505405     0.5892      0.8821

Algorithm 2: Random Forest

Tuning

#view hyperparameters
rf.info <- getModelInfo("rf")
rf.info$rf$parameters
  parameter   class                         label
1      mtry numeric #Randomly Selected Predictors
#create a grid of possible values for mtry 
tune.grid <- expand.grid(mtry = 4:8)

Training

rf_model <- train(playlist_genre ~. , 
                  data = training_set, 
                  method = "rf", 
                  trControl = train_cv_kfold, 
                  tuneGrid = tune.grid,
                  preProc = c("center"))

rf_model
Random Forest 

833 samples
 16 predictor
  6 classes: 'electronic', 'gaming', 'hip.hop', 'latin', 'pop', 'rock' 

Pre-processing: centered (28) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 417, 416 
Resampling results across tuning parameters:

  mtry  logLoss   AUC        prAUC      Accuracy   Kappa      Mean_F1  
  4     1.185627  0.8576004  0.5295191  0.5918361  0.4861643  0.5234288
  5     1.162771  0.8616323  0.5355522  0.6110266  0.5114216  0.5425920
  6     1.151292  0.8625677  0.5347833  0.6050429  0.5035903  0.5369395
  7     1.149072  0.8601723  0.5284776  0.6110294  0.5124704  0.5450861
  8     1.131435  0.8635862  0.5318941  0.6038381  0.5035724  0.5410953
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.5199637         0.9136537         0.5613423            0.9171028          
  0.5407360         0.9179144         0.5730717            0.9210681          
  0.5341576         0.9165895         0.5665716            0.9196487          
  0.5435135         0.9182603         0.5687050            0.9208975          
  0.5378259         0.9167281         0.5691153            0.9191388          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.5613423       0.5199637    0.09863936           0.7168087             
  0.5730717       0.5407360    0.10183776           0.7293252             
  0.5665716       0.5341576    0.10084048           0.7253735             
  0.5687050       0.5435135    0.10183824           0.7308869             
  0.5691153       0.5378259    0.10063968           0.7272770             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 7.

Testing

rf_predicted <- predict(rf_model, testing_set)

rf_results <- postResample(rf_predicted, testing_set$playlist_genre) 

rf_results
 Accuracy     Kappa 
0.5706215 0.4615646 
confusionMatrix(rf_predicted, testing_set$playlist_genre)
Confusion Matrix and Statistics

            Reference
Prediction   electronic gaming hip.hop latin pop rock
  electronic         14      5       2     0   8    1
  gaming              2      4       2     1   6    1
  hip.hop             5      8      39    15   5    0
  latin               2      5       5    34   7    1
  pop                20     11       9     7  66    9
  rock                1      0       0     0  14   45

Overall Statistics
                                          
               Accuracy : 0.5706          
                 95% CI : (0.5172, 0.6228)
    No Information Rate : 0.2994          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4616          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: electronic Class: gaming Class: hip.hop
Sensitivity                    0.31818       0.12121         0.6842
Specificity                    0.94839       0.96262         0.8889
Pos Pred Value                 0.46667       0.25000         0.5417
Neg Pred Value                 0.90741       0.91420         0.9362
Prevalence                     0.12429       0.09322         0.1610
Detection Rate                 0.03955       0.01130         0.1102
Detection Prevalence           0.08475       0.04520         0.2034
Balanced Accuracy              0.63328       0.54191         0.7865
                     Class: latin Class: pop Class: rock
Sensitivity               0.59649     0.6226      0.7895
Specificity               0.93266     0.7742      0.9495
Pos Pred Value            0.62963     0.5410      0.7500
Neg Pred Value            0.92333     0.8276      0.9592
Prevalence                0.16102     0.2994      0.1610
Detection Rate            0.09605     0.1864      0.1271
Detection Prevalence      0.15254     0.3446      0.1695
Balanced Accuracy         0.76458     0.6984      0.8695

Algorithm 3: Multinomial Logistic Regression

Multinomial logistic regression can be used when the outcome variable is categorical and is more than two categories. It works well for unordered outcome variables and is easier to interpret than other algorithms.

Tuning

#view hyperparameters
mn_info <- getModelInfo("multinom")
mn_info$multinom$parameters
  parameter   class        label
1     decay numeric Weight Decay
#create a grid of possible values for decay
tune.grid <- expand.grid(decay = c(0, .1, 1))

Training

mn_model <- train(playlist_genre ~. , 
                  data = training_set, 
                  method = "multinom", 
                  trControl = train_cv_kfold, 
                  tuneGrid = tune.grid,
                  preProc = c("center", "scale"))
# weights:  180 (145 variable)
initial  value 747.163699 
iter  10 value 420.225246
iter  20 value 381.322745
iter  30 value 364.501676
iter  40 value 359.570515
iter  50 value 358.299096
iter  60 value 358.116326
iter  70 value 358.075127
iter  80 value 358.062465
iter  90 value 358.054325
iter 100 value 358.051411
final  value 358.051411 
stopped after 100 iterations
# weights:  180 (145 variable)
initial  value 747.163699 
iter  10 value 422.367854
iter  20 value 386.337410
iter  30 value 371.360767
iter  40 value 368.262388
iter  50 value 367.619229
iter  60 value 367.564703
iter  70 value 367.556201
iter  80 value 367.555296
iter  90 value 367.555194
iter  90 value 367.555190
iter  90 value 367.555190
final  value 367.555190 
converged
# weights:  180 (145 variable)
initial  value 747.163699 
iter  10 value 437.687690
iter  20 value 412.965570
iter  30 value 407.696436
iter  40 value 407.183164
iter  50 value 407.137004
iter  60 value 407.134076
final  value 407.134026 
converged
# weights:  180 (145 variable)
initial  value 745.371939 
iter  10 value 414.945333
iter  20 value 376.306877
iter  30 value 356.708913
iter  40 value 351.935670
iter  50 value 350.693568
iter  60 value 350.522278
iter  70 value 350.492181
iter  80 value 350.483368
iter  90 value 350.477701
iter 100 value 350.473990
final  value 350.473990 
stopped after 100 iterations
# weights:  180 (145 variable)
initial  value 745.371939 
iter  10 value 416.562231
iter  20 value 380.750132
iter  30 value 363.914362
iter  40 value 360.852452
iter  50 value 360.252084
iter  60 value 360.205828
iter  70 value 360.194303
iter  80 value 360.192187
iter  90 value 360.191892
iter 100 value 360.191828
final  value 360.191828 
stopped after 100 iterations
# weights:  180 (145 variable)
initial  value 745.371939 
iter  10 value 429.456149
iter  20 value 407.828144
iter  30 value 401.213703
iter  40 value 400.763399
iter  50 value 400.707168
iter  60 value 400.704210
final  value 400.704125 
converged
# weights:  180 (145 variable)
initial  value 1492.535638 
iter  10 value 863.665126
iter  20 value 848.647817
iter  30 value 842.940704
iter  40 value 839.847013
iter  50 value 838.290141
iter  60 value 837.730500
iter  70 value 837.505405
iter  80 value 837.449287
iter  90 value 837.439699
iter 100 value 837.438718
final  value 837.438718 
stopped after 100 iterations
mn_model
Penalized Multinomial Regression 

833 samples
 16 predictor
  6 classes: 'electronic', 'gaming', 'hip.hop', 'latin', 'pop', 'rock' 

Pre-processing: centered (28), scaled (28) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 417, 416 
Resampling results across tuning parameters:

  decay  logLoss   AUC        prAUC      Accuracy   Kappa      Mean_F1  
  0.0    1.466745  0.8304200  0.4992322  0.5234072  0.4124987  0.4901317
  0.1    1.326127  0.8339954  0.5029721  0.5282063  0.4169487  0.4899681
  1.0    1.236696  0.8332737  0.5043038  0.5426236  0.4305708  0.4939988
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.4905827         0.9023908         0.4947759            0.9025364          
  0.4903900         0.9030643         0.4954029            0.9035171          
  0.4912196         0.9049553         0.5125618            0.9062978          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.4947759       0.4905827    0.08723454           0.6964868             
  0.4954029       0.4903900    0.08803438           0.6967272             
  0.5125618       0.4912196    0.09043727           0.6980875             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was decay = 1.

Testing

mn_predicted <- predict(mn_model, testing_set)

mn_results <- postResample(mn_predicted, testing_set$playlist_genre) 

mn_results
 Accuracy     Kappa 
0.5423729 0.4279958 
confusionMatrix(mn_predicted, testing_set$playlist_genre)
Confusion Matrix and Statistics

            Reference
Prediction   electronic gaming hip.hop latin pop rock
  electronic         16      3       3     3   6    3
  gaming              1      5       1     2   2    0
  hip.hop             6      5      36    15   6    1
  latin               5      6       9    28   6    1
  pop                14     14       6     7  64    9
  rock                2      0       2     2  22   43

Overall Statistics
                                          
               Accuracy : 0.5424          
                 95% CI : (0.4889, 0.5951)
    No Information Rate : 0.2994          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.428           
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: electronic Class: gaming Class: hip.hop
Sensitivity                    0.36364       0.15152         0.6316
Specificity                    0.94194       0.98131         0.8889
Pos Pred Value                 0.47059       0.45455         0.5217
Neg Pred Value                 0.91250       0.91837         0.9263
Prevalence                     0.12429       0.09322         0.1610
Detection Rate                 0.04520       0.01412         0.1017
Detection Prevalence           0.09605       0.03107         0.1949
Balanced Accuracy              0.65279       0.56641         0.7602
                     Class: latin Class: pop Class: rock
Sensitivity                0.4912     0.6038      0.7544
Specificity                0.9091     0.7984      0.9057
Pos Pred Value             0.5091     0.5614      0.6056
Neg Pred Value             0.9030     0.8250      0.9505
Prevalence                 0.1610     0.2994      0.1610
Detection Rate             0.0791     0.1808      0.1215
Detection Prevalence       0.1554     0.3220      0.2006
Balanced Accuracy          0.7002     0.7011      0.8301

Comparing Our Models

comparison_stats <- rbind(svm_results, rf_results, mn_results)

comparison_stats
             Accuracy     Kappa
svm_results 0.4435028 0.3228992
rf_results  0.5706215 0.4615646
mn_results  0.5423729 0.4279958

Random forest did it again!

Clustering

For an example of a clustering algorithm, we will use k-means clustering. In k-means, the researcher defines the cluster number and clusters are defined by estimating centroids.

Clean data

For clustering, remove non-numeric values and scale the variables.

spotify_cluster <- spotify_data_clean %>%
  select(where(is.numeric))

spotify_cluster <- scale(spotify_cluster)

Choose number of clusters

The “Elbow Method” is a common method for visually determining how many clusters there may be in your data. For each k in a k-means algorithm, the sum of squared distances between each data point is calculated. The “elbow” or bend in the plot identifies where the rate of decrease drops and levels off.

set.seed(471)

wss <- map_dbl(1:10, function(k) {
  kmeans(spotify_cluster, centers = k, nstart = 25)$tot.withinss
})
Warning: did not converge in 10 iterations
Warning: did not converge in 10 iterations
elbow_df <- data.frame(
  k = 1:10,
  wss = wss
)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = 1:10) +
  labs(
    x = "Number of clusters",
    y = "Total within-cluster sum of squares",
    title = "Elbow Plot for K-Means"
  )

Run k-means

set.seed(471)

kmeans_model <- kmeans(
  spotify_cluster,
  centers = 5,
  nstart = 25 #run 25 random starting points
)

Compare cluster information to track characteristics

#add cluster data into dataframe 
spotify_with_clusters <- spotify_data_clean %>%
  mutate(cluster = factor(kmeans_model$cluster))

#compare with genre 
table(spotify_with_clusters$cluster, spotify_with_clusters$playlist_genre)
   
    afrobeats ambient arabic blues brazilian classical country electronic folk
  1         5      43      3    13         0         0       1          5   24
  2         9       8     20     6         7         0       0         99    5
  3         0       4      0     1         0        10       0         18    1
  4         6       6     27     2         7         0       0         13    0
  5         0       0      0    19         0         0       2         10    2
   
    gaming hip.hop indian indie j.pop jazz k.pop korean latin lofi metal pop
  1      3      18      5     0     0    1     0      4    16    2     0  58
  2     69      72      4     3     9    0     9      2   122    0     0 195
  3     11       2      0     1     0    0     0      0     1    0     0   5
  4     17     123      0     0     2    0     1      0    42    0     0  17
  5      0      12      0     0     0    0     0      2     3    0    30  56
   
    punk r.b reggae rock soul turkish world
  1    0  14      0   14    1       4     1
  2   10  17      0   29    1       3     2
  3    0   0      0    4    0       0     0
  4    0   8      3    2    0       0     1
  5   36   9      2  151    0       0     0
#create plot to visualize energy, loudness, and valence for clusters 
plot_ly(
  data = spotify_with_clusters,
  x = ~valence,
  y = ~danceability,
  z = ~energy,
  color = ~cluster,
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 3)
)
#release year density distribution 
spotify_with_clusters %>%
  ggplot(aes(
    x = release_year,
    y = cluster,
    fill = cluster
  )) +
  geom_density_ridges() +
  theme_classic() +
  labs(
    x = "Release Year",
    y = "Cluster",
    title = "Distribution of Release Years by Cluster"
  )
Picking joint bandwidth of 2.1

Resources

Caret Package Documentation

StatQuest with Josh Starmer

Tidy Modeling with R

Mini Hacks

Mini Hack #1: Predicting the Groove

Using the spotify_data_clean dataset, please split your data using an 80-20 split. Use a regression algorithm of your choice to predict the danceability of a particular track. Please keep the seed set to 100 for reproducibility.

set.seed(100)

Mini Hack #2: Minor Setbacks, Major Predictions

Using the spotify_data_clean dataset, choose a classification algorithm of your choice to predict whether a track is major :) or minor :(. In the dataset, 0 = minor and 1 = major. Please keep the seed set to 50 for reproducibility.

set.seed(50)

Mini Hack #3:

For this mini-hack, we are going to have a modeling competition! You will be provided with two data sets, one for regression and one for classification. Your job will be to tune the best model possible. You can use any algorithms you’d like, but I would encourage you to try a few. They can be fancy or simple, and you can do whatever you want to the data, as long as Rob can reproduce it (so make sure to use set.seed() where appropriate, like data splitting). Your model will be tested against held out data that Rob has stored separately from the ones shared here.

The winner will receive a prize (of little value) and bragging rights!

regression

For this, I want you to build a model to best predict the variable burgl_per_pop which is the amount of burglaries per capita.

# load data
crime <- read_csv('https://www.dropbox.com/scl/fi/6govdl6cca3ntqsrk4a04/crime_training.csv?rlkey=ipoiuqqyyjhiww75436jgidbj&dl=1')
Rows: 1333 Columns: 114
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): communityname, state
dbl (112): pop, per_housh, pct_black, pct_white, pct_asian, pct_hisp, pct12_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

classification

This data set contains variables related to participants psychological state. I want you to build a classification model to best classify the variable psychological_state which is the the emotional state each person reported being in.

# load data
psych_state <- read_csv('https://www.dropbox.com/scl/fi/4hplpyu1ln58gvb9fg5li/psych_state_training.csv?rlkey=x0tzgp9bf39nbiq9dx1fb8s23&dl=1')
Rows: 802 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): eeg_power_bands, blood_pressure_mm_hg, cognitive_load, mood_state, ...
dbl (9): hrv_ms, gsr_m_s, oxygen_saturation_percent, heart_rate_bpm, ambient...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.