Presentation is loading. Please wait.

Presentation is loading. Please wait.

전자전기컴퓨터공학부 홍기주 고급컴퓨터알고리듬.  환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면, 어떤 문제가 있는가 ? (limited features) ◦ Training variance  training set 이 조금만 바뀌어도 예측 결과가 크게.

Similar presentations


Presentation on theme: "전자전기컴퓨터공학부 홍기주 고급컴퓨터알고리듬.  환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면, 어떤 문제가 있는가 ? (limited features) ◦ Training variance  training set 이 조금만 바뀌어도 예측 결과가 크게."— Presentation transcript:

1 전자전기컴퓨터공학부 홍기주 고급컴퓨터알고리듬

2  환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면, 어떤 문제가 있는가 ? (limited features) ◦ Training variance  training set 이 조금만 바뀌어도 예측 결과가 크게 달라짐 ◦ Non-monotone effect  ideal healthy weight is in a bounded range, not arbitrarily heavy or arbitrarily light. ◦ Linearly inseparable data  선형 분리가 안됨

3  Reducing training variance with bagging and random forests  Learning non-monotone relationships with generalized additive models  Increasing data separation with kernel methods  Modeling complex decision boundaries with support vector machines

4  Decision trees are an attractive method for a number of reasons: ◦ They take any type of data, numerical or categorical, without any distributional assumptions and without preprocessing. ◦ Most implementations (in particular, R’s) handle missing data; the method is also robust to redundant and nonlinear data. ◦ The algorithm is easy to use, and the output (the tree) is relatively easy to understand. ◦ Once the model is fit, scoring is fast.

5  On the other hand, decision trees do have some drawbacks: ◦ They have a tendency to overfit, especially without pruning ◦ They have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy ◦ Prediction accuracy can be low, compared to other methods 위의 단점을 개선하기 위해 bagging 혹은 random forest 를 이용함

6  Using bagging to improve prediction ◦ Data set  spamD.tsv (https://github.com/WinVector/zmPDSwR/raw/master/Spambase/spam D.tsv)https://github.com/WinVector/zmPDSwR/raw/master/Spambase/spam D.tsv

7  Using bagging to improve prediction ◦ Preparing Spambase data

8  Using bagging to improve prediction ◦ Evaluating the performance of decision trees

9  Using bagging to improve prediction ◦ Evaluating the performance of decision trees The accuracy and F1 scores both degrade on the test set, and the deviance increases

10  Using bagging to improve prediction ◦ Bagging decision trees

11  Using bagging to improve prediction ◦ Bagging decision trees Bagging improves accuracy and F1, and reduces deviance over both the training and test sets when compared to the single decision tree (less generalization error )

12  Using random forests to further improve prediction ◦ Bagging 의 단점  개별 tree 에서 사용되는 Feature set 이 거의 동일함 랜덤 포레스트란 ? 랜덤 포레스트 알고리즘 과정은 모집단으로부터 추출된 training dataset 에서 복 원 추출에 의해 부스트랩 데이터를 생성한다. 이러한 방법을 N 번 반복하여 N 개의 부트스트랩 데이터를 생성하고, 의사결정나무 알고리즘을 적용할 때 각각의 노드 에서 랜덤하게 m 개의 설명변수를 선택한다. 랜덤 포레스트 방법론은 트리 (tree) 사 이에 상관관계를 줄임으로써 baging 방법에 비해 분산을 줄여준다는 장점을 가지 고 있다.

13

14

15  Using random forests to further improve prediction ◦ Using random forests

16  Using random forests to further improve prediction ◦ Report the model quality The random forest model performed dramatically better than the other two models in both training and test. But the random forest’s generalization error was comparable to that of a single decision tree (and almost twice that of the bagged model).

17  Using random forests to further improve prediction ◦ Examining Variable Importance  randomForest() 호출 시 importance = T 를 설정하면 variable importance 를 계산

18  Using random forests to further improve prediction ◦ Examining Variable Importance 중요한 변수 선택을 통해 더 작고 빠르게 tree 를 만드는 것이 가능하 고, 다른 modeling algorithm 에서 사용하는 것 또한 가능함

19  Using random forests to further improve prediction ◦ Examining Variable Importance 중요한 변수 선택을 통해 더 작고 빠르게 tree 를 만드는 것이 가능하 고, 다른 modeling algorithm 에서 사용하는 것 또한 가능함

20  Using random forests to further improve prediction ◦ Fitting with fewer variables The smaller model performs just as well as the random forest model built using all 57variables.

21  Bagging and random forest takeaways ◦ Bagging stabilizes decision trees and improves accuracy by reducing variance. ◦ Bagging reduces generalization error. ◦ Random forests further improve decision tree performance by de- correlating the individual trees in the bagging ensemble. ◦ Random forests’ variable importance measures can help you determine which variables are contributing the most strongly to your model. ◦ Because the trees in a random forest ensemble are unpruned and potentially quite deep, there’s still a danger of overfitting. Be sure to evaluate the model on holdout data to get a better estimate of model performance.

22  Understanding GAMs ◦ 저 체중 환자라면, 몸무게가 늘릴수록 더욱 건강해 질 수 있다. 하지만 거기에도 한계가 있다. (non monotone)

23  A one-dimensional regression example ◦ Preparing an artificial problem

24  A one-dimensional regression example ◦ Linear regression applied to our artificial example data 가 sin() 과 cos() 에 의해 만들어졌으므로 linear 하지 않음. R-squared 가 0.04 로 매우 낮음

25  A one-dimensional regression example ◦ Linear regression applied to our artificial example 현재 model 은 error 가 heteroscedastic 함

26  A one-dimensional regression example ◦ GAM applied to our artificial example

27  A one-dimensional regression example ◦ GAM applied to our artificial example

28  A one-dimensional regression example ◦ GAM applied to our artificial example

29  A one-dimensional regression example ◦ GAM applied to our artificial example The GAM has been fit to be homoscedastic

30  A one-dimensional regression example ◦ Comparing linear regression and GAM performance The GAM performed similarly on both sets (RMSE of 1.40 on test versus 1.45 on training; R-squared of 0.78 on test versus 0.83 on training).

31  Extracting the nonlinear relationships ◦ Extracting a learned spline from a GAM

32  Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data Dataset CDC 2010 natality dataset (https://github.com/WinVector/zmPDSwR/blob/master/CDC/NatalB irthData.rData)https://github.com/WinVector/zmPDSwR/blob/master/CDC/NatalB irthData.rData 주어진 데이터를 이용하여 신생아의 몸무게를 예측 독립변수 mother’s weight (PWGT) mother’s pregnancy weight gain(WTGAIN) mother’s age(MAGER) number of prenatal medical visits(UPREVIS)

33  Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data

34  Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data

35  Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data

36  Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data edf 가 1 보다 크므로 4 개 의 변수 모두 nonlinear 관계라 할 수 있음

37  Using GAM on actual data ◦ Plotting GAM results

38  Using GAM on actual data ◦ Plotting GAM results S() spline 과 smooth curve 의 형태가 similar 함

39  Using GAM on actual data ◦ Checking GAM model performance on hold-out data Train set 과 비교하여 큰 차이가 없으므로 크게 overfit 되지 않았다고 할 수 있음

40  Using GAM for logistic regression ◦ GLM logistic regression ◦ GAM logistic regression 신생아의 몸무게가 2000 이하 (DBWT<2000) 예측

41  Using GAM for logistic regression ◦ GAM logistic regression

42  GAM takeaways ◦ GAMs let you represent nonlinear and non-monotonic relationships between variables and outcome in a linear or logistic regression framework. ◦ In the mgcv package, you can extract the discovered relationship from the GAM model using the predict() function with the type="terms" parameter. ◦ You can evaluate the GAM with the same measures you’d use for standard linear or logistic regression: residuals, deviance, R- squared, and pseudo R-squared. The gam() summary also gives you an indication of which variables have a significant effect on the model. ◦ Because GAMs have increased complexity compared to standard linear or logistic regression models, there’s more risk of overfit.

43  Synthetic variables? ◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로 운 변수를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여 새로운 변수를 만들 수 있는데 이를 Synthetic variable 이라고 함 ◦ Kernel method 를 이용하여 새로운 변수를 만들어서 machine learning 의 성능을 향상시킴

44  Kernel method ◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로운 변수 를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여 새로운 변수를 만들 수 있는데 이를 Synthetic variable 이라고 함 ◦ Kernel method 를 이용하여 새로운 변수를 만들어서 machine learning 의 성능을 향상시킴

45  Understanding kernel functions ◦ An artificial kernel example k(u,v) = phi(u) %*% phi(v)

46  Understanding kernel functions Kernel transformation 을 이용하여 linear 하게 데이 터를 나누는 것이 목표

47  Using an explicit kernel on a problem ◦ Applying stepwise linear regression to PUMS data (https://github.com/WinVector/zmPDSwR/raw/master/PUMS/psub.RData)https://github.com/WinVector/zmPDSwR/raw/master/PUMS/psub.RData

48  Using an explicit kernel on a problem ◦ Applying an example explicit kernel transform Phi() 를 이용하여 새로운 modeling variable 을 생성함

49  Using an explicit kernel on a problem ◦ Applying an example explicit kernel transform

50  Using an explicit kernel on a problem ◦ Modeling using the explicit kernel transform RMSE 가 조금 개선됨

51  Using an explicit kernel on a problem ◦ Inspecting the results of the explicit kernel model age 와 log income 간의 non-monotone 관계를 반영하기 위해 AGEP_AGEP 라는 새 로운 변수를 사용함

52  Kernel takeaways ◦ Kernels provide a systematic way of creating interactions and other synthetic variables that are combinations of individual variables ◦ The goal of kernelizing is to lift the data into a space where the data is separable, or where linear methods can be used directly

53  Understanding support vector machines 선형 분리가 불가능한 데이터 (left) 를 고차원 커널 공간으로 lift(right) 하여 데이터를 선형 분리하는 초 평면을 구하는 문제

54  Trying an SVM on artificial example data ◦ Setting up the spirals data as an example classification problem

55  Trying an SVM on artificial example data ◦ Setting up the spirals data as an example classification problem

56  SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a poor choice of kernel 커널을 잘못 선택한 경우

57  SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a poor choice of kernel

58  SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a good choice of kernel

59  Using SVMs on real data ◦ Revisiting the Spambase example with GLM

60  Using SVMs on real data ◦ Applying an SVM to the Spambase example

61  Using SVMs on real data ◦ Printing the SVM results summary

62  COMPARING RESULTS ◦ Shifting decision point to perform an apples-to-apples comparison SVM 에서 spam(false postive) 을 162 개로 예측했으므로 GLM 의 threshold 도 수정

63  Support vector machine takeaways ◦ SVMs are a kernel-based classification approach where the kernels are represented in terms of a (possibly very large) subset of the training examples. ◦ SVMs try to lift the problem into a space where the data is linearly separable (or as near to separable as possible). ◦ SVMs are useful in cases where the useful interactions or other combinations of input variables aren’t known in advance. They’re also useful when similarity is strong evidence of belonging to the same class.

64  Bagging and random forests—To reduce the sensitivity of models to early modeling choices and reduce modeling variance  Generalized additive models—To remove the (false) assumption that each model feature contributes to the model in a monotone fashion  Kernel methods—To introduce new features that are nonlinear combinations of existing features, increasing the power of our model  Support vector machines—To use training examples as landmarks (support vectors), again increasing the power of our model

65 END


Download ppt "전자전기컴퓨터공학부 홍기주 고급컴퓨터알고리듬.  환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면, 어떤 문제가 있는가 ? (limited features) ◦ Training variance  training set 이 조금만 바뀌어도 예측 결과가 크게."

Similar presentations


Ads by Google