Presentation is loading. Please wait.

Presentation is loading. Please wait.

© OLIVER WYMAN COMPARISON OF GLMs AND MACHINE LEARNING ALGORITHMS 2016 CASE SPRING MEETING Scott Sobel, FCAS, MAAA, MSPA March 22, 2016.

Similar presentations


Presentation on theme: "© OLIVER WYMAN COMPARISON OF GLMs AND MACHINE LEARNING ALGORITHMS 2016 CASE SPRING MEETING Scott Sobel, FCAS, MAAA, MSPA March 22, 2016."— Presentation transcript:

1 © OLIVER WYMAN COMPARISON OF GLMs AND MACHINE LEARNING ALGORITHMS 2016 CASE SPRING MEETING Scott Sobel, FCAS, MAAA, MSPA March 22, 2016

2 1 © OLIVER WYMAN 1 What are the differences between statistics and machine learning? StatisticsMachine Learning XXYXXXYX XXYXXXYX

3 2 © OLIVER WYMAN 2 1.Background 2.Analytical Process 3.Methodologies 4.Implementation Topics

4 Background Section 1

5 4 © OLIVER WYMAN 4 Linear Regression Source: https://en.wikipedia.org/wiki/Regression_analysis Earliest form of regression was the method of least squares, published by Legendre (1805) used to determine the orbits of comets and planets around the Sun Form was: e(  ) = Y - X  where the  ’s were estimated by minimizing the sum of squared residuals  e 2 Gauss (1809) introduced using the Normal distribution for the error terms Galton coined the term “regression” (1800s) to describe how the heights of children of tall parents tend to “regress down” and children of short parents tend to “regress up”. (“Regression towards the mean”). Current regression model form: Y =  X + e =  0 +  1 X 1 +  2 X 2 + … + e

6 5 © OLIVER WYMAN 5 Linear Regression

7 6 © OLIVER WYMAN 6 Machine learning uses an algorithm to learn a function that maps input variables to the target output from training data: Y = f(X) –Specific model form is unknown Statistical Learning = Parametric Models –No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs –Suited to simpler problems Machine Learning = Nonparametric Models –Feature selection is less important –Requires fewer assumptions and can result in superior performance –May require more training data –Risk of overfitting, often more difficult to explain Machine Learning Source: http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/

8 7 © OLIVER WYMAN 7 1952Arthur Samuel wrote the first computer learning program to play checkers that improved as it played. 1957Frank Rosenblatt designed the first neural network. 1967Nearest neighbor algorithm allowed computers to begin using basic pattern recognition. 1979Students at Stanford University invent the “Stanford Cart” that navigated obstacles. 1981Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data. 1985Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does. 1990sScientists begin creating algorithms to learn from large amounts of data. 1997IBM’s Deep Blue beats the world champion at chess. 2006Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos. 2010Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures. 2011Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does. 2011IBM’s Watson beats its human competitors at Jeopardy. 2012Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats. 2014Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can. 2015Amazon launches its own machine learning platform. 2015Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers. 2016Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. (Update! Lee wins a game!) Advancement of Machine Learning Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning

9 8 © OLIVER WYMAN 8 Seeing, crawling, walking, learning, speaking, playing, driving, challenging humans… 1952Arthur Samuel wrote the first computer learning program to play checkers that improved as it played. 1957Frank Rosenblatt designed the first neural network. 1967Nearest neighbor algorithm allowed computers to begin using basic pattern recognition. 1979Students at Stanford University invent the “Stanford Cart” that navigated obstacles. 1981Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data. 1985Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does. 1990sScientists begin creating algorithms to learn from large amounts of data. 1997IBM’s Deep Blue beats the world champion at chess. 2006Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos. 2010Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures. 2011Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does. 2011IBM’s Watson beats its human competitors at Jeopardy. 2012Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats. 2014Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can. 2015Amazon launches its own machine learning platform. 2015Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers. 2016Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. (Update! Lee wins a game!) Advancement of Machine Learning Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning

10 9 © OLIVER WYMAN 9 4 reasons: 1.The field has matured both in terms of identity and in terms of methods and tools 2.There is an abundance of data available 3.There is an abundance of computation to run methods 4.There have been impressive results, increasing acceptance, respect, and competition Stanford's coursera machine learning course had more than 100,000 expressing interest in the first year. That is crazy! Why is machine learning is popular right now? Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf ResourcesIngredientsToolsDesirePopularity+++=

11 10 © OLIVER WYMAN 10 In his book “Statistical Modeling: The Two Cultures”, Leo Breiman discusses two cultures that contrast traditional statistical theory and machine learning. Traditional statistical theory, or data modeling, culture: Assumes data is generated by a stochastic process. Breiman argues this “has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.” Machine learning, or algorithmic modeling, culture: Has developed rapidly largely outside theoretical statistics Uses algorithmic models, treating the data mechanism as unknown “If our goal is to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.” Does machine learning leave statistical theory behind? Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf

12 11 © OLIVER WYMAN 11 (“Statistical Modeling: The Two Cultures”) Data modeling Response = f(predictors, parameters, random noise) Parameters are estimated, random noise is what’s left over Model evaluation based on assumed fit to a statistical distribution Algorithmic modeling Finds a function/algorithm f(x) to predict y Model evaluation based on predictive accuracy Does machine learning leave statistical theory behind? How well can we fit the data using a given model? How well can we fit a model to the given data? Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf

13 12 © OLIVER WYMAN 12 Statistical LearningMachine Learning OriginsSubfield of statisticsSubfield of artificial intelligence EmphasisGreater emphasis on model structure, interpretability, precision and uncertainty Greater emphasis on model flexibility, large-scale applications, and prediction accuracy Does machine learning leave statistical theory behind? Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013 Distinction has become more and more blurred, with increasing cross-fertilization

14 Analytical Process Section 2

15 14 © OLIVER WYMAN 14 In all algorithms, whether machine learning or statistical learning, you need to: Develop expectations Match expectations with data Collect the data Establish the specific modeling question Perform EDA, adjust the data Build model(s) –Make assumptions –Define “best fit” function –Recognize concerns –Evaluation –Produce predictions Interpret model(s) Communicate results Common analytical process Source of image: https://leanpub.com/artofdatascience

16 15 © OLIVER WYMAN 15 Statistical LearningMachine Learning Input DataWants “tidy data”Generally wants “tidy data” but some algorithms are designed for unstructured data Does not handle missing values well Can easily handle missing values Data Adjustments Assumes linear relationships between Y and input X’s Looks for patterns; does not assume linearity Sensitive to multicollinearityGenerally more forgiving about multicollinearity Modeling Assumptions Assumes residuals follow distribution in exponential family Does not assume data follows a certain distribution Differences

17 16 © OLIVER WYMAN 16 Statistical LearningMachine Learning Minimization Task Minimizes squared errorMinimizes a defined “loss” function ConcernsFeature selection very important Underfitting & Overfitting Multicollinearity External confounding effects Patterns in the residuals Missed nonlinear relationships Feature selection not as crucial Overfitting! (Bias vs. Variance) General lack of transparency Model Evaluation Statistically-based goodness-of-fit: Adj R 2, F-test, p-values, AIC/BIC, confidence intervals Residual analysis Can use cross-validation and/or hold-out data testing Uses cross-validation and/or hold- out data testing as measures of convergence Differences

18 17 © OLIVER WYMAN 17 Training Error –Measure of how well the model performs on the input data –Not the most important thing Validation Error –Measure of how well the model performs on unseen data –More important than training error, it indicates generalization performance Signal –Data representative of the true underlying process –Noise – data not representative Overfitting –Degree to which the model is influenced by noise –Results from an overly complex model –Undesirable for good generalization/prediction Some concepts Just right Overfitting

19 Methodologies Section 3

20 19 © OLIVER WYMAN 19 Statistical Learning –Regression Machine Learning –Clustering -Decision Trees -k-Means –Instance-Based Algorithms -k-Nearest Neighbors -Bayesian Algorithms -Support Vector Machines (SVM) –Artificial Neural Networks -Self-Organizing Maps (SOMs) / Kohonen maps –Ensembles -Random Forests Methodology types

21 20 © OLIVER WYMAN 20 Regression Source of images: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013 X Y

22 21 © OLIVER WYMAN 21 Models the independent linear relationship between input variables and the known target: –General model structure assumed –Can be used for classification Variations: –Ordinary least squares (OLS) –Generalized linear models (GLMs) –Logistic –Multivariate adaptive regression splines (MARS) –Locally estimated scatterplot smoothing (LOESS) Model Evaluation: –Residual analyses –Many goodness-of-fit tests based on assumption of statistical distribution: -RSS, Adj R 2, F-test, AIC, BIC, p-values, standard errors/confidence intervals –Should also use cross-validation and/or hold-out test data Regression Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

23 22 © OLIVER WYMAN 22 Concerns: –Unknown data points –Sensitive to outliers –Nonlinearity between X’s and Y –Correlations within X variables –Sufficient spread across each X values –Extrapolation beyond data –Heteroscedasticity / patterns in residuals / errors don’t follow assumed distribution Enhancements: –Feature selection -Forward, backward, stepwise regression -LASSO, Ridge regression –Variable interactions -CHAID, Machine learning algorithms –Categorical value groupings Regression Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

24 23 © OLIVER WYMAN 23 Clustering: Decision Trees Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013 R1R1 R2R2 R3R3 R3R3 R2R2 R1R1 -or-

25 24 © OLIVER WYMAN 24 Decision Trees vs. Linear Models – Depends on underlying data structure Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013 1.True linear boundary: Linear model fits perfectly 3.True linear boundary: Flexible decision tree can approximate 2.True nonlinear boundary: Linear model fits poorly 4.True nonlinear boundary: Simple decision tree fits perfectly

26 25 © OLIVER WYMAN 25 Stratifying or segmenting predictor space into simple axis-parallel regions –Specific model form is data-driven –Can be used for classification or regression –Very fast but typically not competitive with better learning approaches, especially ensembles Variations: –Classification and Regression Tree (CART), Iterative Dichotomiser 3 (ID3), C4.5 and C5.0 –Chi-squared Automatic Interaction Detection (CHAID) –Conditional Decision Trees Model Evaluation—Goodness-of-fit tests based on empirical performance: –Confusion matrix for accuracy/misclassification –Between vs. within variance –Cross-validation and/or hold-out test data Clustering: Decision Trees Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

27 26 © OLIVER WYMAN 26 Concerns: –Overfitting – use pruning –Sensitive to data structure, not as much to outliers –Local minima & maxima –Small variations in experiments may lead to seemingly dramatically different decision trees – test by similarity of predictions Enhancements: –Variable interactions – can create curved boundaries in addition to axis-parallel –Categorical value groupings – how to handle low credibility? –Ensembling Clustering: Decision Trees Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ R1R1 R2R2 Y < X 1 X 2 Y ≥ X 1 X 2 R1R1 R2R2

28 27 © OLIVER WYMAN 27 General Model Form: (minimize within cluster variance) Loss Function:Variance or Distance measure Clustering: k-Means Source: “The Elements of Statistical Learning”, Hastie et al., 2 nd ed., 2013

29 28 © OLIVER WYMAN 28 Finds k clusters based on a measure of “closeness”: –Not necessarily axis parallel regions –Specific model form is data-driven –Diagnostics and modeler preference helps determine k –Typically can only use numeric predictors Variations: –k-Medians –k-Modes for categorical predictors –Hierarchical clustering Model Evaluation—Goodness-of-fit tests based on empirical performance: –Confusion matrix –Between vs. within variance –Cross-validation and/or hold-out test data Clustering: k-Means Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

30 29 © OLIVER WYMAN 29 Concerns: –Overfitting –Sensitive to data structure & outliers –Sensitive to lopsided cluster variances such as with insurance data, especially fraud –Local minimums & maximums Enhancements: –Feature selection –Variable interactions – can create additional feature space dimensions –Ensembling Clustering: k-Means Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

31 30 © OLIVER WYMAN 30 General Model Form: Loss Function:% Misclassified Instance-Based Algorithm: k-Nearest Neighbors Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013 X1 X2 X1

32 31 © OLIVER WYMAN 31 Finds separation boundary based on k neighboring data points: –Certainly not axis-parallel regions –Specific model form is data driven –Diagnostics and modeler preference helps determine k –Can be used for classification or regression Model Evaluation—Goodness-of-fit tests based on empirical performance: –Confusion matrix –RMSE –Cross-validation and/or hold-out test data Concerns: –Sensitive to data structure, can overfit –Local minima & maxima –Sensitive to lopsided cluster variances such as with insurance data, esp. fraud Instance-Based Algorithm: k-Nearest Neighbors Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ Enhancements: –Feature selection –Extended k-NN (ENN) –Weighted distances

33 32 © OLIVER WYMAN 32 Produces more robust model by combining many base learners: –Think of diversifying risk in Finance –Prediction quality benefits from averaging many models –Complexity increases but variance decreases faster as wayward errors cancel out –Base learners can be simple/weak, decorrelation is key to reducing the variance Variations: –Boosting –Bootstrapped Aggregation (Bagging) –AdaBoost –Stacked Generalization (blending) –Gradient Boosting Machines (GBM) –Gradient Boosted Regression Trees (GBRT) –Random Forests –Unlimited Ensembling Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

34 33 © OLIVER WYMAN 33 Bootstrapping –A way of creating additional training data to build a more stable model –Build a dataset of the same size as your training data by repeatedly sampling with replacement from your original training data –Some data points may be included more than once while others will be left out Random Forests Algorithm 1.Build many deep decision trees on bootstrapped training datasets 2.For each tree, only consider a random subset of predictors (  √ p is typical) 3.Average the predictions from all individual decision trees –Random forests generally performs very well –Very robust –A go-to favorite among many data scientists Ensembling: Random Forests

35 34 © OLIVER WYMAN 34 Concerns: –Correlation of individual trees –Can overfit – literature is divided on this –Not as sensitive to data structure, outliers, local minima or maxima –Transparency, communicability, feature contributions Enhancements: –Depth of decision trees –Number of training iterations –Variable interactions – can create curved boundaries in addition to axis-parallel –Categorical value groupings – how to handle low credibility? –Confidence intervals for input variables Ensembling: Random Forests Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

36 Implementation Section 3

37 36 © OLIVER WYMAN 36 GLMs –Individual tables of factors, one for each input variable –Easy to interpret –Simply replace current factors in same structure of ratepages Decision Trees –Set of if/then conditions that define groupings/tiers/segments –Easy to interpret –New step in rating algorithm k-Means –Based on distance to cluster centroids –Simply assign new data point to closest cluster centroid –Since unsupervised, more difficult to accept since human intuition has not formed –Could be used for target marketing to customer lifestyles –Can enhance variable groupings Implementation Variable XFactor A1.200 B1.000 …… C0.800 if (condition) group 1 else if (condition) group 2 … else group n end if

38 37 © OLIVER WYMAN 37 ScoreFactor 1-1001.200 101-2001.000 …… 901-10000.800 k-Nearest Neighbors –Assign new data points according to decision boundaries –Supervised, but nonlinear and multidimensional, so assignments may be not be intuitive, may seem arbitrary near boundaries –New step in rating algorithm Ensembling –Based on learned composite set of rules to assign prediction –May be non-intuitive, but influence of variables can be determined –Can significantly improve accuracy –Calculation of confidence intervals for Random Forests is possible –New step in rating algorithm Implementation

39 38 © OLIVER WYMAN 38 GLMs –Re-rate inforce book, calculate distribution of changes as some DOIs require –Examine extremes, but unclear how to fix – temper selections (how?) or cap changes? –Multiple factors may change, but can test impacts by hand –Easy to explain to customer Decision Trees –Impacts are few and discrete –Easy to see and temper extremes –Could be large jumps to different segments –May be difficult to explain k-Means –Re-groupings – simpler to determine distribution of changes –Easy to see and temper transition to new grouping –Could be large jumps to different clusters –May be difficult to explain Testing & explaining impacts

40 39 © OLIVER WYMAN 39 k-Nearest Neighbors –Could be large jumps to different classes for points near the border –May be difficult to explain Ensembling: Random Forests –Provides a granular and gradual spectrum of indications, often grouped (deciles, etc.) –Easy to see and temper extremes –May be difficult to explain – “natural” progressions could cause unexpected, undesirable, and compounded impacts Testing & explaining impacts

41 40 © OLIVER WYMAN 40 So what’s the verdict? GLMsMachine Learning Theoretical  Vast literature   Linear, dependence on exponential distribution – models real-world relationships?  Nonlinear, nonparametric – could be more accurate  Feature selection, outliers, missing data, multicollinearity very important  Less concern over feature selection, outliers, missing data, multicollinearity  Underfitting & overfittingokRisk of overfitting but this can be controlled Practical  Easier to interpret  May be difficult to interpret  Accepted by DOIs  /  Beginning to be accepted (remember early stages of filing CAT models?)  No changes to infrastructure  Most likely need a new rating step  Easy to productionalize  Most likely need a rules engine // May not be as accurate, but may be better choice if you can’t take advantage of full machine learning indications  / ?May be more accurate if you can change your infrastructure to take advantage The practical issues are rapidly going away in favor of machine learning…

42 41 © OLIVER WYMAN 41 © OLIVER WYMAN Scott Sobel, FCAS, MAAA, MSPA Oliver Wyman Actuarial Consulting, Inc. 325 John H. McConnell Blvd Columbus, OH 43215 phone: +1 614 227 6225 cell: +1 803 429 3153 scott.sobel@oliverwyman.com www.oliverwyman.com @ssobel2010


Download ppt "© OLIVER WYMAN COMPARISON OF GLMs AND MACHINE LEARNING ALGORITHMS 2016 CASE SPRING MEETING Scott Sobel, FCAS, MAAA, MSPA March 22, 2016."

Similar presentations


Ads by Google