1 Introduction

In the contents of this project, analysis and modeling strategies are utilized in tandem with a dataset of song information to develop a regression model predictive of song popularity based upon impactful variables. The steps used to find the ideal predictor are detailed in the contents of this report.

2 Exploratory Data Analysis

2.1 Description of input

As an initial exploration of the dataset, the below summary is used to find some intuitive concepts about the columns in the song dataset. The dataframe includes individual observations of many different songs. Each song has a ranking of popularity, its length in milliseconds, key and time signature as categorical variables, a binary called “audio mode” to describe if the song is in a major or minor scale, loudness in decibels (db), tempo in beats per minute, and many scaled numeric values to rank between 0 and 1.

The lastly mentioned numeric values describe the following traits: - acousticness: a confidence measure of whether the track is acoustic - danceability: how suitable a song is for dancing based off a combination of tempo, rhythms, and other metrics - energy: a measure of intensity and activity, with high energy tracks often louder or faster - speechiness: a detection of spoken words on a track, higher in pieces of rap, poetry, or talk shows - instrumentalness: confidence metric of whether a track is instrumental and wordless - liveness: a measure of probability that the track was recorded in front of a live audience

Information provided by Spotify & Towards Data Science here: https://towardsdatascience.com/what-makes-a-song-likeable-dbfdb7abe404

2.2 Summary of full dataset

##   song_name         song_popularity  song_duration_ms   acousticness     
##  Length:18835       Min.   :  0.00   Min.   :  12000   Min.   :0.000001  
##  Class :character   1st Qu.: 40.00   1st Qu.: 184340   1st Qu.:0.024100  
##  Mode  :character   Median : 56.00   Median : 211306   Median :0.132000  
##                     Mean   : 52.99   Mean   : 218212   Mean   :0.258539  
##                     3rd Qu.: 69.00   3rd Qu.: 242844   3rd Qu.:0.424000  
##                     Max.   :100.00   Max.   :1799346   Max.   :0.996000  
##   danceability        energy        instrumentalness         key        
##  Min.   :0.0000   Min.   :0.00107   Min.   :0.0000000   Min.   : 0.000  
##  1st Qu.:0.5330   1st Qu.:0.51000   1st Qu.:0.0000000   1st Qu.: 2.000  
##  Median :0.6450   Median :0.67400   Median :0.0000114   Median : 5.000  
##  Mean   :0.6333   Mean   :0.64499   Mean   :0.0780080   Mean   : 5.289  
##  3rd Qu.:0.7480   3rd Qu.:0.81500   3rd Qu.:0.0025700   3rd Qu.: 8.000  
##  Max.   :0.9870   Max.   :0.99900   Max.   :0.9970000   Max.   :11.000  
##     liveness         loudness         audio_mode      speechiness    
##  Min.   :0.0109   Min.   :-38.768   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0929   1st Qu.: -9.044   1st Qu.:0.0000   1st Qu.:0.0378  
##  Median :0.1220   Median : -6.555   Median :1.0000   Median :0.0555  
##  Mean   :0.1797   Mean   : -7.447   Mean   :0.6281   Mean   :0.1021  
##  3rd Qu.:0.2210   3rd Qu.: -4.908   3rd Qu.:1.0000   3rd Qu.:0.1190  
##  Max.   :0.9860   Max.   :  1.585   Max.   :1.0000   Max.   :0.9410  
##      tempo        time_signature  audio_valence  
##  Min.   :  0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 98.37   1st Qu.:4.000   1st Qu.:0.335  
##  Median :120.01   Median :4.000   Median :0.527  
##  Mean   :121.07   Mean   :3.959   Mean   :0.528  
##  3rd Qu.:139.93   3rd Qu.:4.000   3rd Qu.:0.725  
##  Max.   :242.32   Max.   :5.000   Max.   :0.984

2.3 Count of Values Across Variables

##        song_name  song_popularity song_duration_ms     acousticness 
##            13070              101            11771             3209 
##     danceability           energy instrumentalness              key 
##              849             1132             3925               12 
##         liveness         loudness       audio_mode      speechiness 
##             1425             8416                2             1224 
##            tempo   time_signature    audio_valence 
##            12112                5             1246

2.4 Distribution of Dependent Variable

To get a starting point, it is important to understand the distribution of song popularity ratings from this dataset, visualized here:

2.5 Correlation Matrix, Kernel Densities & Distributions

Beginning determinations of variable correlations led to development correlation matrix plot for the full dataframe, as well as many kernel density plots in two dimensions between variables, and single-variable histograms. The correlation plot is shown as the first figure below, a sample kernel density display plot follows, and single-variate histograms are subplotted last. All other kernel density plots are viewable in the appendix of the report. Audio mode, key, and time signature are modified to categorical variables.

It is notable that a kde was impossible for time signature and song popularity. R documentation for this function demands that no two quantiles be the same for any input variables. The time signature column is dominated by many values of “4” - to the point of taking up the 25%, 50% and 75% quantiles. With that, the kde function is unable to process this information effectively.

3 Simple Regression Models

3.1 Independent Variable Relationships

From the original correlation matrix, linkages can be identified between: 1. acousticness and energy [inverse] 2. loudness and acousticness [inverse] 3. loudness and energy [direct] The linear models expected for each are explored below:

3.2 Independent ~ Dependent Relationships

Intuitive knowledge of music can also be used to link many of these variables to song popularity, but trials to test these connections on a simplistic basis are tested below. For the time being, a simple linear regression model was created using linkages between each variable and song popularity to view impacts.

## 
## Call:
## lm(formula = df$song_popularity ~ df$energy, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.041 -12.955   2.962  15.992  46.988 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.9018     0.5067 104.409   <2e-16 ***
## df$energy     0.1397     0.7456   0.187    0.851    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.91 on 18833 degrees of freedom
## Multiple R-squared:  1.863e-06,  Adjusted R-squared:  -5.123e-05 
## F-statistic: 0.03509 on 1 and 18833 DF,  p-value: 0.8514

Many identified connections in the raw data fell short when an attempt was mad to link to song popularity. Each of the simple linear regression models indicated similar irrelevancies or inadequacies, driving next steps further into feature engineering analysis to manipulate the variables into relevant data pieces.

4 Variable Transformations

4.1 Energy, Loudness, and Acousticness

The previously discussed relationship between energy, loudness, and inverse of acousticness was transformed into a multiplicative property to compare with song popularity. Following this computation (detailed below), the idea was tossed, as despite a linkage existing between the three sound variables, the strength of the connection did not extend to song popularity.

## $coefficients
##                   Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)      78.827376 1.23377247  63.891340 0.000000e+00
## df$energy       -22.904510 1.25245781 -18.287650 4.542973e-74
## df$acousticness  -7.018223 0.73307639  -9.573658 1.156002e-21
## df$loudness       1.241733 0.06321062  19.644375 4.587967e-85
## 
## $r.squared
## [1] 0.02730094

4.2 Duration/Outliers

One of the most intuitive and first transformations used on the dataset featured converting the track duration from milliseconds to minutes in addition to getting rid out outlier durations to prevent extreme durations from dominating the list.

4.3 Duplicate Track Entries

A quick glance at the “raw” top ten most popular songs also showed a number of duplicate entries. The next cleansing transformation was to rid the dataset of these duplicates. To showcase the completion of the transformation, the adjusted top ten is displayed below.

##  [1] "Happier"                                       
##  [2] "I Love It (& Lil Pump)"                        
##  [3] "Taki Taki (with Selena Gomez, Ozuna & Cardi B)"
##  [4] "Eastside (with Halsey & Khalid)"               
##  [5] "Promises (with Sam Smith)"                     
##  [6] "In My Feelings"                                
##  [7] "Falling Down"                                  
##  [8] "SICKO MODE"                                    
##  [9] "In My Mind"                                    
## [10] "Lucid Dreams"

The top ten list piqued curiosity for further exploration, with a correlation matrix of values reproduced for the top 100 in the appendix to satisfy this curiosity.

4.4 Quintile Development

The next development idea consisted of quantiling the dataset based off song popularity, dividing the data set into groups based off scoring and making predictions within each section based off of a classification model of variable distinguishments. The steps are outlined as:

  1. First, dividing the dataframe from above into quintiles from the song_popularity variable.

  2. Finding the means across quintile booleans

  1. Finding the standard deviations across quintiles

  1. Finding quintile regression models (removed all but 1 to use as example)

  1. Training and testing of above model(s)
## 
##   0   1 
## 0.8 0.2
## 
## Call:
## glm(formula = pop_5 ~ energy + loudness + instrumentalness + 
##     acousticness, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2821  -0.7357  -0.6008  -0.1986   3.9714  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.23217    0.19470   11.46   <2e-16 ***
## energy           -2.92016    0.19020  -15.35   <2e-16 ***
## loudness          0.18957    0.01204   15.74   <2e-16 ***
## instrumentalness -3.06269    0.28002  -10.94   <2e-16 ***
## acousticness     -1.24168    0.11223  -11.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13196  on 13184  degrees of freedom
## Residual deviance: 12340  on 13180  degrees of freedom
## AIC: 12350
## 
## Number of Fisher Scoring iterations: 6
##    
##     FALSE
##   0 10548
##   1  2637

Quantile Modeling Conclusion: This sector of modeling was unsuccessful, as no combination of variables was able to predict the quintile of song popularity.

5 Final Regression Model

5.1 Final Model Selection

From many versions of testing, the conclusion comes to energy becoming the driving variable behind linear regression modeling. Any attempts at feature engineering outside of duplicate removal appeared to be unsuccessful in improving the predictive nature of the model. The final regression was selected to include energy, key, and speechiness to predict song popularity.

The model summary and plot for the selected final regression model:

## 
## Call:
## lm(formula = song_popularity ~ energy + key + speechiness, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.868 -12.691   2.677  15.882  47.004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.8905     0.6723  78.669  < 2e-16 ***
## energy       -0.2700     0.7469  -0.362  0.71771    
## key1          2.9352     0.6669   4.401 1.08e-05 ***
## key2         -1.1305     0.7054  -1.603  0.10903    
## key3         -2.5444     1.0735  -2.370  0.01779 *  
## key4         -1.2334     0.7608  -1.621  0.10499    
## key5          0.1841     0.7226   0.255  0.79890    
## key6          2.0780     0.7577   2.743  0.00610 ** 
## key7         -1.9612     0.6738  -2.911  0.00361 ** 
## key8         -0.4631     0.7585  -0.611  0.54150    
## key9         -1.7187     0.7074  -2.430  0.01513 *  
## key10         0.4763     0.7618   0.625  0.53186    
## key11         1.0894     0.7214   1.510  0.13106    
## speechiness   3.0004     1.5438   1.943  0.05198 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.85 on 18821 degrees of freedom
## Multiple R-squared:  0.005641,   Adjusted R-squared:  0.004954 
## F-statistic: 8.212 on 13 and 18821 DF,  p-value: < 2.2e-16

5.2 Coefficient Parameter Interpretation

From our original correlation figure, the expectation of each parameter is for a positive energy coefficient and positive speechiness coefficient. The above model matches expectations for speechiness but not for energy, but the strength of the coefficient matches the expecation of the weak relations showcased at the beginning of the report. Each key coefficient varies based off of the positive or negative impact of that key on the popularity score. For example, if energy and speechiness are zero, the base is key 0 with a basis score of 52.89. A track in key 1 will raise this score by about 2.94, but a track in key 9 will drop the score by -1.72. These coefficients are binary to what key bucket the track falls in.

5.3 Prediction Modeling - Train/Test Split

The model was applied to a 70/30 train-test split of the transformed data set to test predictive success. The root-mean-square errors are printed below for both the training dataset and the testing dataset.

## 
## The root mean square error of the training data is: 21.84841
## 
## The root mean square error of the testing data is: 21.84161

The RMSE values display a relationship of how spread out the residuals are and how concentrated the data is around the prediction model. The ideal model will minimize the RMSE values and also be quite close between the values, as the training and testing data will be ideally equally successful in prediction from the model. This final prediction model minimized and maintained close proximity between these values.

5.4 Sensitivity analysis of final model

The sensitivity analysis wrangled each categorical key into separate variables to quantify individual standard deviations. The sum of the standard deviations is the absolute value of the produced model coefficients multiplied by the standard deviation of the data used to produce the model.

Each variable is multiplied by the model coefficient and standardized by dividing by the sum of the standard deviations.

The sensitivity plot represents the percent of the model driven by each variable individually and the positive or negative direction of the variable represents the parameter’s correlation with the resulting score predicted.

5.5 Actual vs Predicted Fit

Below is the generated figure of predicted values in blue plotted over the observed values in the data set.

This model should be considered “safe”, as this data is quite complex and full of unrelated parameters when trying to develop a model to predict song popularity. The final regression submitted presents a “middle of the pack” prediction of score based on key, energy, and speechiness, given that these independent variables generated the best outcomes of RSME and fits for a model. The song popularity is otherwise randomly distributed with little to no correlation predictable by a model.

6 Appendix

6.1 Additional Exploratory Visualizations

6.1.1 KDE Plots

## 
## Time Signature has quantiles of: 1 4 4 4 5 which does not permit it to be used in a 
## kde function and becomes a fairly useless variable

6.1.2 Categorical variable exploration

6.1.3 top 100 correlation exploration