Presentation is loading. Please wait.

Presentation is loading. Please wait.

PREDICTING SONG HOTNESS

Similar presentations


Presentation on theme: "PREDICTING SONG HOTNESS"— Presentation transcript:

1 PREDICTING SONG HOTNESS
MILLION SONGS PREDICTING SONG HOTNESS MICHAEL BALL, NISHOK CHETTY, ROHAN ROY CHOUDHURY, ALPER VURAL

2 Music industry makes a lot of money from popular music
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Music industry makes a lot of money from popular music Highly invested in identifying trending features Especially interested in an algorithmic way to evaluate potential popularity of a new song Can we predict whether a song is going to be popular? Can we determine what factors make a song popular? CHETTY

3 Using machine learning, predict whether a song is going to be popular
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Using machine learning, predict whether a song is going to be popular Use feature importance metrics to explore what makes certain songs popular Quality metrics: classification accuracy, ROC/AUC CHETTY

4 Dataset name: Million songs Dataset size: 1 million song records
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset name: Million songs Dataset size: 1 million song records Stored as compressed HDF5 files Features include: key duration energy tempo artist details and more…(50+ features) Class label: song hotness (popularity metric) CHETTY

5 Data cleaning/imputation: Dropped records with missing hotness data
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Data cleaning/imputation: Dropped records with missing hotness data Dropped records with missing year Imputed longitude, latitude, location Checked for duplicate keys (song_id as our unique record identifier) Checked for statistical anomalies using the basic statistics described previously. Only anomalies: energy and danceability columns, which we dropped. MICHAEL

6 OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS
DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Hotness Hotness Artist Familiarity Loudness MICHAEL Data Set is highly similar, which we know about pop-much Many unrated songs, with a bias towards more recent music Hotness Hotness Year

7 Create a decade feature
TF-IDF on song_title Create a decade feature We know that music patterns can be described by decades: binned years → decades. Genre Bag of words on artist_terms MSDS (surprisingly!) does not have a column for genre of a song. We categorized songs into an appropriate genres based on the content of artist_terms. Ablation to determine optimal feature set OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS MICHAEL

8 Tuned using 5 fold cross-validation with Grid Search
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS L EARNINGS Tuned using 5 fold cross-validation with Grid Search SVM (baseline) Kernel: RBF, C: 256 Random Forest Max depth: 40, Min samples for split: 10, Num trees: 10 Logistic Regression C: 512 Decision Tree Depth: 5, Min samples for split: 10 Adaboost Num trees: 200, Learning rate: 0.01 K-Nearest Neighbors k = 1 Neural Network (Multi-Layer Perceptron) Algorithm: l-bfgs, learning rate: ALPER

9 OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS
DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Model Accuracy (%) Baseline (frequency-based) 55.2 Baseline (SVM) 56.3 Neural Network (MLP) 67.7 kNN 71.2 Logistic Regression 73.4 Decision Tree 72 Random Forest 77.8 Adaboost 74.6 ROHAN Significant improvement over the baseline Used a simple SVM for baseline Baseline: 56.3 Best model: random forest – almost 80% accuracy

10 ROC Curve for Random Forest model
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS ROC Curve for Random Forest model ROHAN SVM (baseline) Adaboost Neural Network

11 Pandas for efficient data handling, cleaning and imputation
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Spark to process 270 GB dataset into 1 GB csv; also for ML models (with sparkit-learn) h5py library to read the dataset (dataset stored in HDF5 binary format) Pandas for efficient data handling, cleaning and imputation Numpy and Scipy for data exploration and analysis Scikit-learn for machine learning models Sparkit-learn for machine learning models on EC2 Matplotlib for data visualization ALPER

12 Able to predict song popularity with ~80% accuracy
OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset Learnings: Able to predict song popularity with ~80% accuracy Random Forest model performed best Feature importance (from information gain metric of RF model): Artist familiarity, Artist popularity, Loudness, Tempo, Keywords: pop, jazz, classic, guitar, hop, metal, new, power, world Data Science Learnings: Importance of feature engineering (BoW on artist_terms, TF-IDF on song_title) significantly improved results Accuracy isn’t enough – need to look at ROC MICHAEL Interesting ?: How do you break into music?


Download ppt "PREDICTING SONG HOTNESS"

Similar presentations


Ads by Google