Machine Learning practical

Machine Learning practical
with Kaggle Alan Chalk

Disclaimer The views expressed in this presentation are those of the author, Alan Chalk, and not necessarily of the Staple Inn Actuarial Society

Over the next 45 minutes ? Loss functions R Greedy algorithms Python
Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

Renthop (2 sigma connect)

Example renthop posting

What is our task?

Evaluation (loss function)
Example (i) True class (j) Probability high Probability medium Probability low 1 medium 0.1 0.6 0.3 2 high 0.0 0.5 Why not use accuracy? Is it possible to do machine learning without performance measurement?

Process… Create Pipeline Baseline guess Improve guess

Create pipeline and baseline
Read raw data Clean data Create guess Submit on Kaggle Baseline guess Now go through code 00a, 00b, 01a, 02a, 04__ and 04a

First R code 00a_Packages.R 00b_Working Directories.R
01a_ReadRawData.R 01b_CleanData.R 04__LoadAndPrepareData.R 04a_BaselinePredictions.R

Bottom of the leaderboard
Any comments from people? Someone should notice I got 26.5, NOT (The columns were in the wrong order!)

Decision trees bathrooms, bedrooms, latitude, longitude, price,
listing_id Given the above list of features – starting with all the rentals, which subset would you choose to separate out those of high and low interest? Why did you choose this? You must have had some “inherent loss function”. If we get the computer to exhaustively go through all to find the best improvement to loss – which loss function should we choose?

Some decision tree vocab
CART (rpart), C5.0 etc split rule (loss function) NP-hard greedy over-fitting complexity parameter

R code: formulas In an R formula, ~ means “is allowed to depend on”. Our first formula is: interest_level ~ bathrooms + bedrooms + latitude + longitude + price + listing_id (In our code, you will see that this formula is saved in a variable called “fmla_”)

R code: rpart code rpart_1 <- rpart(fmla_ ,
data = dt_all[idx_train1,], method = "class", cp = 1e-8, )

R code 04b01_rpart.R Go through the rcode upto creating a first simple tree and interpreting it

A first tree for renthop
Ask someone to describe the node with highest interest (2 or more bedrooms for less than $1,829 per month)

How can we do better? Better techniques
More features (feature engineering) Anything else? Ask audience – we are looking for 2 things – better technigues and more features.

Feature engineering? Based only on the data and files provided:
bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, listing_id, longitude, manager_id, photos, price, street_address, interest_level Note: You also have loads and loads of photos. “description” is free format. “features” is a list of words.

Feature engineering? Simple features; price per bedroom, bathroom / bedroom ratio, created hour or day of week Simplifications of complex features: number of photos, number of words in description Presence of each feature or not; e.g. laundry yes or no Good value rental

High cardinality features?
manager_id, building_id Simplifications: “size” of manager or building turn into numeric What else? Talk about approach to credibility – ask audience how many examples for one manager or building – before we rely on that experience rather than overall model. Stattistics – 1000? And what is shape of curve? ML says – use the data to find out.

Leakage features? A key aspect of winning Machine Learning competitions Where might there be leakage in the data we have been given Paper: Leakage in Data Mining: Formulation, Detection, and Avoidance, (Kaufman, Rosset and Perlich)

Now what? We have loads of features – good
But there is every chance that our decision tree will pick up random noise in the training data (called “variance”) How can we control for this? Cost complexity pruning

R code 02b_FeatureCreation_1/2/3.R 04b_01_rpart.R
04b_02_VariableImportance.R

Training and validation curves

Variable importance

Random Forest Introduce randomness. Why?
Boostrapping and then aggregating the results (“bagging”) How else can we create randomness Sample the features available to each split OOB error What are our hyper-parameters? number of trees? nodesize? mtry?

R code: random forest code
x = dt_train, y = as.factor(y_train), ntree = 300, nodesize = 1, mtry = 6, keep.forest = TRUE)

R code 04c_RandomForest.R

gradient boosting Add lots of “weak learners”
Create new weak learners by focusing on examples which are incorrectly classified Add all the weak learners using weights which are higher for the better weak learners The weak learners are “adaptive basis functions”

hpyerparameters learning rate depth of trees min child weight
data subsampling column subsampling grid search random search hyperopt

Python code xgb.train(params = param , dtrain = xg_train
, num_boost_round = num_rounds , evals = watchlist , early_stopping_rounds = 20 , verbose_eval = False )

Gradient boosting (machines)
04d_GradientBoosting_presentation.ipynb

Over the next 45 minutes ? Loss functions R Greedy algorithms Python
Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

Machine Learning practical

Similar presentations

Presentation on theme: "Machine Learning practical"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning practical

Similar presentations

Presentation on theme: "Machine Learning practical"— Presentation transcript:

Similar presentations

About project

Feedback