Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution
Outline S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions bpimediagroup.com S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Why decision trees? Why not? S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
But why decision trees? Kaggle CEO and Founder Anthony Goldbloom: ”…in the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted and Neural Networks” ”…It used to be random forest that was the big winner, but over the last six months a new algorithm called XGboost has cropped up, and it’s winning practically every competition in the structured data category” S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Why NIR spectroscopic data? When a linear regression can be better that the decision trees methods? when relationship between X and y is fully linear when there is a very large number of features with low S/N ratio when covariate shift is likely S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions www.ign.com S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Drinks beer? yes no Knows statistics? Not chemometrician Chemometrician Steals ideas from statisticians? Chemometrician Not chemometrician S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables Where are other variables? On every split the best variable is used Number of splits (tree depth) is limited Efficiency of split is a reduction of misclassification errors S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? –50% Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits 50 –44% 6 –2% 4 S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees for numeric variables How many splits? –50% Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits 50% –44% 6% –2% 4% Use cross-validation to calculate the errors S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees regression Variable importance Is calculated for each variable individually Take s into account the role of a variable in different splits Is accumulated across all splits and normalized S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees regression Response variable is split into several bins Minimize variance in each node S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions viapesnyary.ru S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees ensembles Ensemble learning — combine several models together A group of week learners can perform better when together decrease variance, make prediction more stable and reliable S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Decision trees ensembles Bagging Create N random subsets (sampling with replacement) Train model for every subset (parallel) Use simple average for prediction Random forest Boosting Train a model from a random subset Make N better models by using new subsets (sequential) Use weighted average for prediction Randomly with replacement Gradient boosting S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions www.foodsafety.com.au S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Prediction of fat content in chopped meat samples by NIR spectra Tecator Prediction of fat content in chopped meat samples by NIR spectra http://lib.stat.cmu.edu/datasets/tecator 100 predictors (NIR spectra by Tecator Infratec Food and Feed Analyzer, 850–1050 nm) 215 measurements (172 for calibration and 43 for test) S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — predictions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — the tree S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — variable importance S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — variable selection S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — variable selection S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Random forest — predictions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Random forests — importance of variables S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Random forests — variable selection S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions www.hunterolives.asn.au S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Olives S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — the tree and splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Single tree — classification S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Random forest — classification S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Variable importance S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
What decision trees are? Decision trees ensembles Cases Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions www.foodsafety.com.au S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Conclusions ”The bottom line is: You can spend 3 hours playing with the data, generating features and interaction variables and get a 77% r-squared; and I can “from sklearn.ensemble import RandomForestRegressor” and in 3 minutes get an 82% r-squared.” S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
IASIM-2016 S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
IASIM-2018, June 17-20 2018, Seattle, WA, USA 12 March — for student scholarship April 5 — for abstract www.iasim18.iasim.net S. Kucheryavskiy, WSC-11, Saint Petersburg 2018