In the contents of this project, analysis and modeling strategies are utilized in tandem with a dataset of song information to develop a regression model predictive of song popularity based upon impactful variables. The steps used to find the ideal predictor are detailed in the contents of this report.
As an initial exploration of the dataset, the below summary is used to find some intuitive concepts about the columns in the song dataset. The dataframe includes individual observations of many different songs. Each song has a ranking of popularity, its length in milliseconds, key and time signature as categorical variables, a binary called “audio mode” to describe if the song is in a major or minor scale, loudness in decibels (db), tempo in beats per minute, and many scaled numeric values to rank between 0 and 1.
The lastly mentioned numeric values describe the following traits: - acousticness: a confidence measure of whether the track is acoustic - danceability: how suitable a song is for dancing based off a combination of tempo, rhythms, and other metrics - energy: a measure of intensity and activity, with high energy tracks often louder or faster - speechiness: a detection of spoken words on a track, higher in pieces of rap, poetry, or talk shows - instrumentalness: confidence metric of whether a track is instrumental and wordless - liveness: a measure of probability that the track was recorded in front of a live audience
Information provided by Spotify & Towards Data Science here: https://towardsdatascience.com/what-makes-a-song-likeable-dbfdb7abe404
## song_name song_popularity song_duration_ms acousticness
## Length:18835 Min. : 0.00 Min. : 12000 Min. :0.000001
## Class :character 1st Qu.: 40.00 1st Qu.: 184340 1st Qu.:0.024100
## Mode :character Median : 56.00 Median : 211306 Median :0.132000
## Mean : 52.99 Mean : 218212 Mean :0.258539
## 3rd Qu.: 69.00 3rd Qu.: 242844 3rd Qu.:0.424000
## Max. :100.00 Max. :1799346 Max. :0.996000
## danceability energy instrumentalness key
## Min. :0.0000 Min. :0.00107 Min. :0.0000000 Min. : 0.000
## 1st Qu.:0.5330 1st Qu.:0.51000 1st Qu.:0.0000000 1st Qu.: 2.000
## Median :0.6450 Median :0.67400 Median :0.0000114 Median : 5.000
## Mean :0.6333 Mean :0.64499 Mean :0.0780080 Mean : 5.289
## 3rd Qu.:0.7480 3rd Qu.:0.81500 3rd Qu.:0.0025700 3rd Qu.: 8.000
## Max. :0.9870 Max. :0.99900 Max. :0.9970000 Max. :11.000
## liveness loudness audio_mode speechiness
## Min. :0.0109 Min. :-38.768 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0929 1st Qu.: -9.044 1st Qu.:0.0000 1st Qu.:0.0378
## Median :0.1220 Median : -6.555 Median :1.0000 Median :0.0555
## Mean :0.1797 Mean : -7.447 Mean :0.6281 Mean :0.1021
## 3rd Qu.:0.2210 3rd Qu.: -4.908 3rd Qu.:1.0000 3rd Qu.:0.1190
## Max. :0.9860 Max. : 1.585 Max. :1.0000 Max. :0.9410
## tempo time_signature audio_valence
## Min. : 0.00 Min. :0.000 Min. :0.000
## 1st Qu.: 98.37 1st Qu.:4.000 1st Qu.:0.335
## Median :120.01 Median :4.000 Median :0.527
## Mean :121.07 Mean :3.959 Mean :0.528
## 3rd Qu.:139.93 3rd Qu.:4.000 3rd Qu.:0.725
## Max. :242.32 Max. :5.000 Max. :0.984
## song_name song_popularity song_duration_ms acousticness
## 13070 101 11771 3209
## danceability energy instrumentalness key
## 849 1132 3925 12
## liveness loudness audio_mode speechiness
## 1425 8416 2 1224
## tempo time_signature audio_valence
## 12112 5 1246
To get a starting point, it is important to understand the distribution of song popularity ratings from this dataset, visualized here:
Beginning determinations of variable correlations led to development correlation matrix plot for the full dataframe, as well as many kernel density plots in two dimensions between variables, and single-variable histograms. The correlation plot is shown as the first figure below, a sample kernel density display plot follows, and single-variate histograms are subplotted last. All other kernel density plots are viewable in the appendix of the report. Audio mode, key, and time signature are modified to categorical variables.
It is notable that a kde was impossible for time signature and song popularity. R documentation for this function demands that no two quantiles be the same for any input variables. The time signature column is dominated by many values of “4” - to the point of taking up the 25%, 50% and 75% quantiles. With that, the kde function is unable to process this information effectively.
From the original correlation matrix, linkages can be identified between: 1. acousticness and energy [inverse] 2. loudness and acousticness [inverse] 3. loudness and energy [direct] The linear models expected for each are explored below:
Intuitive knowledge of music can also be used to link many of these variables to song popularity, but trials to test these connections on a simplistic basis are tested below. For the time being, a simple linear regression model was created using linkages between each variable and song popularity to view impacts.
##
## Call:
## lm(formula = df$song_popularity ~ df$energy, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.041 -12.955 2.962 15.992 46.988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.9018 0.5067 104.409 <2e-16 ***
## df$energy 0.1397 0.7456 0.187 0.851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.91 on 18833 degrees of freedom
## Multiple R-squared: 1.863e-06, Adjusted R-squared: -5.123e-05
## F-statistic: 0.03509 on 1 and 18833 DF, p-value: 0.8514
Many identified connections in the raw data fell short when an attempt was mad to link to song popularity. Each of the simple linear regression models indicated similar irrelevancies or inadequacies, driving next steps further into feature engineering analysis to manipulate the variables into relevant data pieces.
The previously discussed relationship between energy, loudness, and inverse of acousticness was transformed into a multiplicative property to compare with song popularity. Following this computation (detailed below), the idea was tossed, as despite a linkage existing between the three sound variables, the strength of the connection did not extend to song popularity.
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.827376 1.23377247 63.891340 0.000000e+00
## df$energy -22.904510 1.25245781 -18.287650 4.542973e-74
## df$acousticness -7.018223 0.73307639 -9.573658 1.156002e-21
## df$loudness 1.241733 0.06321062 19.644375 4.587967e-85
##
## $r.squared
## [1] 0.02730094
One of the most intuitive and first transformations used on the dataset featured converting the track duration from milliseconds to minutes in addition to getting rid out outlier durations to prevent extreme durations from dominating the list.
A quick glance at the “raw” top ten most popular songs also showed a number of duplicate entries. The next cleansing transformation was to rid the dataset of these duplicates. To showcase the completion of the transformation, the adjusted top ten is displayed below.
## [1] "Happier"
## [2] "I Love It (& Lil Pump)"
## [3] "Taki Taki (with Selena Gomez, Ozuna & Cardi B)"
## [4] "Eastside (with Halsey & Khalid)"
## [5] "Promises (with Sam Smith)"
## [6] "In My Feelings"
## [7] "Falling Down"
## [8] "SICKO MODE"
## [9] "In My Mind"
## [10] "Lucid Dreams"
The top ten list piqued curiosity for further exploration, with a correlation matrix of values reproduced for the top 100 in the appendix to satisfy this curiosity.
The next development idea consisted of quantiling the dataset based off song popularity, dividing the data set into groups based off scoring and making predictions within each section based off of a classification model of variable distinguishments. The steps are outlined as:
First, dividing the dataframe from above into quintiles from the song_popularity variable.
Finding the means across quintile booleans
##
## 0 1
## 0.8 0.2
##
## Call:
## glm(formula = pop_5 ~ energy + loudness + instrumentalness +
## acousticness, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2821 -0.7357 -0.6008 -0.1986 3.9714
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.23217 0.19470 11.46 <2e-16 ***
## energy -2.92016 0.19020 -15.35 <2e-16 ***
## loudness 0.18957 0.01204 15.74 <2e-16 ***
## instrumentalness -3.06269 0.28002 -10.94 <2e-16 ***
## acousticness -1.24168 0.11223 -11.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13196 on 13184 degrees of freedom
## Residual deviance: 12340 on 13180 degrees of freedom
## AIC: 12350
##
## Number of Fisher Scoring iterations: 6
##
## FALSE
## 0 10548
## 1 2637
Quantile Modeling Conclusion: This sector of modeling was unsuccessful, as no combination of variables was able to predict the quintile of song popularity.
From many versions of testing, the conclusion comes to energy becoming the driving variable behind linear regression modeling. Any attempts at feature engineering outside of duplicate removal appeared to be unsuccessful in improving the predictive nature of the model. The final regression was selected to include energy, key, and speechiness to predict song popularity.
The model summary and plot for the selected final regression model:
##
## Call:
## lm(formula = song_popularity ~ energy + key + speechiness, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.868 -12.691 2.677 15.882 47.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.8905 0.6723 78.669 < 2e-16 ***
## energy -0.2700 0.7469 -0.362 0.71771
## key1 2.9352 0.6669 4.401 1.08e-05 ***
## key2 -1.1305 0.7054 -1.603 0.10903
## key3 -2.5444 1.0735 -2.370 0.01779 *
## key4 -1.2334 0.7608 -1.621 0.10499
## key5 0.1841 0.7226 0.255 0.79890
## key6 2.0780 0.7577 2.743 0.00610 **
## key7 -1.9612 0.6738 -2.911 0.00361 **
## key8 -0.4631 0.7585 -0.611 0.54150
## key9 -1.7187 0.7074 -2.430 0.01513 *
## key10 0.4763 0.7618 0.625 0.53186
## key11 1.0894 0.7214 1.510 0.13106
## speechiness 3.0004 1.5438 1.943 0.05198 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.85 on 18821 degrees of freedom
## Multiple R-squared: 0.005641, Adjusted R-squared: 0.004954
## F-statistic: 8.212 on 13 and 18821 DF, p-value: < 2.2e-16
From our original correlation figure, the expectation of each parameter is for a positive energy coefficient and positive speechiness coefficient. The above model matches expectations for speechiness but not for energy, but the strength of the coefficient matches the expecation of the weak relations showcased at the beginning of the report. Each key coefficient varies based off of the positive or negative impact of that key on the popularity score. For example, if energy and speechiness are zero, the base is key 0 with a basis score of 52.89. A track in key 1 will raise this score by about 2.94, but a track in key 9 will drop the score by -1.72. These coefficients are binary to what key bucket the track falls in.
The model was applied to a 70/30 train-test split of the transformed data set to test predictive success. The root-mean-square errors are printed below for both the training dataset and the testing dataset.
##
## The root mean square error of the training data is: 21.84841
##
## The root mean square error of the testing data is: 21.84161
The RMSE values display a relationship of how spread out the residuals are and how concentrated the data is around the prediction model. The ideal model will minimize the RMSE values and also be quite close between the values, as the training and testing data will be ideally equally successful in prediction from the model. This final prediction model minimized and maintained close proximity between these values.
The sensitivity analysis wrangled each categorical key into separate variables to quantify individual standard deviations. The sum of the standard deviations is the absolute value of the produced model coefficients multiplied by the standard deviation of the data used to produce the model.
Each variable is multiplied by the model coefficient and standardized by dividing by the sum of the standard deviations.
The sensitivity plot represents the percent of the model driven by each variable individually and the positive or negative direction of the variable represents the parameter’s correlation with the resulting score predicted.
Below is the generated figure of predicted values in blue plotted over the observed values in the data set.
This model should be considered “safe”, as this data is quite complex and full of unrelated parameters when trying to develop a model to predict song popularity. The final regression submitted presents a “middle of the pack” prediction of score based on key, energy, and speechiness, given that these independent variables generated the best outcomes of RSME and fits for a model. The song popularity is otherwise randomly distributed with little to no correlation predictable by a model.
##
## Time Signature has quantiles of: 1 4 4 4 5 which does not permit it to be used in a
## kde function and becomes a fairly useless variable